mobiuslabsgmbh
/

Llama-3.1-8b-instruct_4bitgs64_hqq

Text Generation

Transformers

Model card Files Files and versions Community

mobicham commited on Jul 30, 2024

Commit

f6ddfb5

verified ·

1 Parent(s): 66b3e18

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -19

README.md CHANGED Viewed

@@ -5,6 +5,9 @@ inference: false
 pipeline_tag: text-generation
 ---
 This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
@@ -12,31 +15,32 @@ This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-s
 ## Model Size
-| Models            | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|
-|:-------------------:|:--------:|:----------------:|:----------------:|
-| Bitrate (Linear layers)    |   16           |  4.5       | 4.25 |
-| VRAM    |   15.7 (GB)      |  <b>6.1 (GB) </b>       | 6.3 (GB)  |
 ## Model Decoding Speed
-| Models            | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|
-|:-------------------:|:--------:|:----------------:|:----------------:|
-| Decoding* - short seq (tokens/sec)|   53           |    <b>125</b>     |    67    |
-| Decoding* - long  seq (tokens/sec)|   50         |    <b>97</b>      |    65   |
 *: RTX 3090
 ## Performance
-| Models            | fp16             | HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> |
-|:-------------------:|:--------:|:----------------:|:----------------:|
-| ARC (25-shot)     |  60.49 | 60.32 | 57.85 |
-| HellaSwag (10-shot)| 80.16 | 79.21 | 79.28 |
-| MMLU (5-shot)     |  68.98 |       | 67.14 |
-| TruthfulQA-MC2    |  54.03 | 53.89 | 51.87 |
-| Winogrande (5-shot)| 77.98 | 76.24 | 76.4  |
-| GSM8K (5-shot)    |  75.44 |       | 73.47 |
-| Average           |  69.51 |       | 67.67 |
 ## Usage
 First, install the dependecies:
@@ -57,7 +61,9 @@ from hqq.utils.generation_hf import HFGenerator
 #Load the model
 ###################################################
-model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq'
 compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas
 cache_dir = '.'
 model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
@@ -73,7 +79,6 @@ HQQLinear.set_backend(HQQBackend.PYTORCH)
 #prepare_for_inference(model, backend="torchao_int4")
 prepare_for_inference(model, backend="bitblas") #takes a while to init...
 #Generate
 ###################################################
 gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

 pipeline_tag: text-generation
 ---
 This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
+We provide two versions:
+* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
+* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
 ## Model Size
+| Models            | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|  <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
+|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
+| Bitrate (Linear layers)    |   16         |  4.5 | 4.25 | 4.25 |
+| VRAM (GB)                  |   15.7       |  6.1 | 6.3  | 5.7  |
 ## Model Decoding Speed
+| Models            | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|   <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
+|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
+| Decoding* - short seq (tokens/sec)|   53         |    <b>125</b>     |    67   | 3.7 |
+| Decoding* - long  seq (tokens/sec)|   50         |    <b>97</b>      |    65   | 21  |
 *: RTX 3090
 ## Performance
+| Models            | fp16       | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
+|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:|
+| ARC (25-shot)      | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
+| HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
+| MMLU (5-shot)      | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
+| TruthfulQA-MC2     | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
+| Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4  | 76.64 |
+| GSM8K (5-shot)     | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
+| Average            | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 |
+You can reproduce the results above via `pip install lm-eval==0.4.3`
 ## Usage
 First, install the dependecies:
 #Load the model
 ###################################################
+model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
+#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version
 compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas
 cache_dir = '.'
 model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
 #prepare_for_inference(model, backend="torchao_int4")
 prepare_for_inference(model, backend="bitblas") #takes a while to init...
 #Generate
 ###################################################
 gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while