Update README.md
Browse files
README.md
CHANGED
@@ -5,6 +5,9 @@ inference: false
|
|
5 |
pipeline_tag: text-generation
|
6 |
---
|
7 |
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
|
|
|
|
|
|
|
8 |
|
9 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
|
10 |
|
@@ -12,31 +15,32 @@ This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-s
|
|
12 |
|
13 |
|
14 |
## Model Size
|
15 |
-
| Models | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|
|
16 |
-
|
17 |
-
| Bitrate (Linear layers) | 16
|
18 |
-
| VRAM
|
19 |
|
20 |
## Model Decoding Speed
|
21 |
-
| Models | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|
|
22 |
-
|
23 |
-
| Decoding* - short seq (tokens/sec)| 53
|
24 |
-
| Decoding* - long seq (tokens/sec)| 50 | <b>97</b> | 65 |
|
25 |
|
26 |
*: RTX 3090
|
27 |
|
28 |
## Performance
|
29 |
|
30 |
-
| Models | fp16
|
31 |
-
|
32 |
-
| ARC (25-shot)
|
33 |
-
| HellaSwag (10-shot)| 80.16 | 79.21 | 79.28 |
|
34 |
-
| MMLU (5-shot)
|
35 |
-
| TruthfulQA-MC2
|
36 |
-
| Winogrande (5-shot)| 77.98 | 76.24 | 76.4 |
|
37 |
-
| GSM8K (5-shot)
|
38 |
-
| Average
|
39 |
|
|
|
40 |
|
41 |
## Usage
|
42 |
First, install the dependecies:
|
@@ -57,7 +61,9 @@ from hqq.utils.generation_hf import HFGenerator
|
|
57 |
|
58 |
#Load the model
|
59 |
###################################################
|
60 |
-
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq'
|
|
|
|
|
61 |
compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas
|
62 |
cache_dir = '.'
|
63 |
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
|
@@ -73,7 +79,6 @@ HQQLinear.set_backend(HQQBackend.PYTORCH)
|
|
73 |
#prepare_for_inference(model, backend="torchao_int4")
|
74 |
prepare_for_inference(model, backend="bitblas") #takes a while to init...
|
75 |
|
76 |
-
|
77 |
#Generate
|
78 |
###################################################
|
79 |
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
|
|
|
5 |
pipeline_tag: text-generation
|
6 |
---
|
7 |
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
|
8 |
+
We provide two versions:
|
9 |
+
* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
|
10 |
+
* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/
|
11 |
|
12 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
|
13 |
|
|
|
15 |
|
16 |
|
17 |
## Model Size
|
18 |
+
| Models | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|
19 |
+
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
|
20 |
+
| Bitrate (Linear layers) | 16 | 4.5 | 4.25 | 4.25 |
|
21 |
+
| VRAM (GB) | 15.7 | 6.1 | 6.3 | 5.7 |
|
22 |
|
23 |
## Model Decoding Speed
|
24 |
+
| Models | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|
25 |
+
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
|
26 |
+
| Decoding* - short seq (tokens/sec)| 53 | <b>125</b> | 67 | 3.7 |
|
27 |
+
| Decoding* - long seq (tokens/sec)| 50 | <b>97</b> | 65 | 21 |
|
28 |
|
29 |
*: RTX 3090
|
30 |
|
31 |
## Performance
|
32 |
|
33 |
+
| Models | fp16 | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|
34 |
+
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:|
|
35 |
+
| ARC (25-shot) | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
|
36 |
+
| HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
|
37 |
+
| MMLU (5-shot) | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
|
38 |
+
| TruthfulQA-MC2 | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
|
39 |
+
| Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4 | 76.64 |
|
40 |
+
| GSM8K (5-shot) | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
|
41 |
+
| Average | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 |
|
42 |
|
43 |
+
You can reproduce the results above via `pip install lm-eval==0.4.3`
|
44 |
|
45 |
## Usage
|
46 |
First, install the dependecies:
|
|
|
61 |
|
62 |
#Load the model
|
63 |
###################################################
|
64 |
+
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
|
65 |
+
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version
|
66 |
+
|
67 |
compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas
|
68 |
cache_dir = '.'
|
69 |
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
|
|
|
79 |
#prepare_for_inference(model, backend="torchao_int4")
|
80 |
prepare_for_inference(model, backend="bitblas") #takes a while to init...
|
81 |
|
|
|
82 |
#Generate
|
83 |
###################################################
|
84 |
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
|