mobicham commited on
Commit
f6ddfb5
·
verified ·
1 Parent(s): 66b3e18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -19
README.md CHANGED
@@ -5,6 +5,9 @@ inference: false
5
  pipeline_tag: text-generation
6
  ---
7
  This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
 
 
 
8
 
9
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
10
 
@@ -12,31 +15,32 @@ This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-s
12
 
13
 
14
  ## Model Size
15
- | Models | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|
16
- |:-------------------:|:--------:|:----------------:|:----------------:|
17
- | Bitrate (Linear layers) | 16 | 4.5 | 4.25 |
18
- | VRAM | 15.7 (GB) | <b>6.1 (GB) </b> | 6.3 (GB) |
19
 
20
  ## Model Decoding Speed
21
- | Models | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|
22
- |:-------------------:|:--------:|:----------------:|:----------------:|
23
- | Decoding* - short seq (tokens/sec)| 53 | <b>125</b> | 67 |
24
- | Decoding* - long seq (tokens/sec)| 50 | <b>97</b> | 65 |
25
 
26
  *: RTX 3090
27
 
28
  ## Performance
29
 
30
- | Models | fp16 | HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> |
31
- |:-------------------:|:--------:|:----------------:|:----------------:|
32
- | ARC (25-shot) | 60.49 | 60.32 | 57.85 |
33
- | HellaSwag (10-shot)| 80.16 | 79.21 | 79.28 |
34
- | MMLU (5-shot) | 68.98 | | 67.14 |
35
- | TruthfulQA-MC2 | 54.03 | 53.89 | 51.87 |
36
- | Winogrande (5-shot)| 77.98 | 76.24 | 76.4 |
37
- | GSM8K (5-shot) | 75.44 | | 73.47 |
38
- | Average | 69.51 | | 67.67 |
39
 
 
40
 
41
  ## Usage
42
  First, install the dependecies:
@@ -57,7 +61,9 @@ from hqq.utils.generation_hf import HFGenerator
57
 
58
  #Load the model
59
  ###################################################
60
- model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq'
 
 
61
  compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas
62
  cache_dir = '.'
63
  model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
@@ -73,7 +79,6 @@ HQQLinear.set_backend(HQQBackend.PYTORCH)
73
  #prepare_for_inference(model, backend="torchao_int4")
74
  prepare_for_inference(model, backend="bitblas") #takes a while to init...
75
 
76
-
77
  #Generate
78
  ###################################################
79
  gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
 
5
  pipeline_tag: text-generation
6
  ---
7
  This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
8
+ We provide two versions:
9
+ * Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
10
+ * Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/
11
 
12
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
13
 
 
15
 
16
 
17
  ## Model Size
18
+ | Models | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
19
+ |:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
20
+ | Bitrate (Linear layers) | 16 | 4.5 | 4.25 | 4.25 |
21
+ | VRAM (GB) | 15.7 | 6.1 | 6.3 | 5.7 |
22
 
23
  ## Model Decoding Speed
24
+ | Models | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
25
+ |:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
26
+ | Decoding* - short seq (tokens/sec)| 53 | <b>125</b> | 67 | 3.7 |
27
+ | Decoding* - long seq (tokens/sec)| 50 | <b>97</b> | 65 | 21 |
28
 
29
  *: RTX 3090
30
 
31
  ## Performance
32
 
33
+ | Models | fp16 | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
34
+ |:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:|
35
+ | ARC (25-shot) | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
36
+ | HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
37
+ | MMLU (5-shot) | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
38
+ | TruthfulQA-MC2 | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
39
+ | Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4 | 76.64 |
40
+ | GSM8K (5-shot) | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
41
+ | Average | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 |
42
 
43
+ You can reproduce the results above via `pip install lm-eval==0.4.3`
44
 
45
  ## Usage
46
  First, install the dependecies:
 
61
 
62
  #Load the model
63
  ###################################################
64
+ model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
65
+ #model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version
66
+
67
  compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas
68
  cache_dir = '.'
69
  model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
 
79
  #prepare_for_inference(model, backend="torchao_int4")
80
  prepare_for_inference(model, backend="bitblas") #takes a while to init...
81
 
 
82
  #Generate
83
  ###################################################
84
  gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while