update int8 quantization info
Browse files- README.md +30 -17
- quantize_config.json +1 -1
README.md
CHANGED
@@ -37,7 +37,7 @@ For more details about the open-source model of Qwen-14B, please refer to the [G
|
|
37 |
## 要求(Requirements)
|
38 |
|
39 |
* python 3.8及以上版本
|
40 |
-
* pytorch 2.0
|
41 |
* 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
|
42 |
* python 3.8 and above
|
43 |
* pytorch 2.0 and above, 2.0 and above are recommended
|
@@ -104,40 +104,53 @@ For more information, please refer to our [GitHub repo](https://github.com/QwenL
|
|
104 |
|
105 |
### 效果评测
|
106 |
|
107 |
-
我们对BF16和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
|
108 |
|
109 |
-
We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
|
110 |
|
111 |
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|
112 |
|--------------|:----:|:-----------:|:-----:|:---------:|
|
113 |
-
| BF16 | 64.6 | 69.8 |
|
|
|
114 |
| Int4 | 63.3 | 69.0 | 59.8 | 45.7 |
|
115 |
|
116 |
### 推理速度 (Inference Speed)
|
117 |
|
118 |
-
|
119 |
|
120 |
-
We measured the average inference speed of generating 2048 and 8192 tokens
|
121 |
|
122 |
-
|
|
123 |
-
|
124 |
-
|
|
125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
|
127 |
-
具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.
|
128 |
|
129 |
-
In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.
|
|
|
|
|
|
|
|
|
130 |
|
131 |
### 显存使用 (GPU Memory Usage)
|
132 |
|
133 |
-
|
134 |
|
135 |
-
We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under
|
136 |
|
137 |
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
|
138 |
-
|
139 |
-
| BF16 |
|
140 |
-
|
|
|
|
141 |
|
142 |
上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
|
143 |
|
|
|
37 |
## 要求(Requirements)
|
38 |
|
39 |
* python 3.8及以上版本
|
40 |
+
* pytorch 2.0及以上版本
|
41 |
* 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
|
42 |
* python 3.8 and above
|
43 |
* pytorch 2.0 and above, 2.0 and above are recommended
|
|
|
104 |
|
105 |
### 效果评测
|
106 |
|
107 |
+
我们对BF16,Int8和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
|
108 |
|
109 |
+
We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
|
110 |
|
111 |
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|
112 |
|--------------|:----:|:-----------:|:-----:|:---------:|
|
113 |
+
| BF16 | 64.6 | 69.8 | 60.1 | 43.9 |
|
114 |
+
| Int8 | 63.6 | 68.6 | 60.0 | 48.2 |
|
115 |
| Int4 | 63.3 | 69.0 | 59.8 | 45.7 |
|
116 |
|
117 |
### 推理速度 (Inference Speed)
|
118 |
|
119 |
+
我们测算了不同精度模型以及不同FlashAttn库版本下模型生成2048和8192个token的平均推理速度。如图所示:
|
120 |
|
121 |
+
We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively.
|
122 |
|
123 |
+
| Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) |
|
124 |
+
| ------------- | :-------: | :------------------:| :------------------:|
|
125 |
+
| BF16 | v2 | 32.88 | 24.87 |
|
126 |
+
| Int8 | v2 | 29.28 | 24.22 |
|
127 |
+
| Int4 | v2 | 38.72 | 27.33 |
|
128 |
+
| BF16 | v1 | 32.76 | 28.89 |
|
129 |
+
| Int8 | v1 | 28.31 | 23.87 |
|
130 |
+
| Int4 | v1 | 37.81 | 26.46 |
|
131 |
+
| BF16 | Disabled | 29.32 | 22.91 |
|
132 |
+
| Int8 | Disabled | 31.12 | 24.60 |
|
133 |
+
| Int4 | Disabled | 37.65 | 26.00 |
|
134 |
|
135 |
+
具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.8。推理速度是生成8192个token的速度均值。
|
136 |
|
137 |
+
In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens.
|
138 |
+
|
139 |
+
注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。
|
140 |
+
|
141 |
+
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
|
142 |
|
143 |
### 显存使用 (GPU Memory Usage)
|
144 |
|
145 |
+
我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。���结果如下所示:
|
146 |
|
147 |
+
We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. (The GPU memory usage is similar when using flash-attention or not.)The results are shown below.
|
148 |
|
149 |
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
|
150 |
+
| ------------------ | :---------------------------------: | :-----------------------------------: |
|
151 |
+
| BF16 | 30.15GB | 38.94GB |
|
152 |
+
| Int8 | 18.81GB | 27.54GB |
|
153 |
+
| Int4 | 13.01GB | 21.79GB |
|
154 |
|
155 |
上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
|
156 |
|
quantize_config.json
CHANGED
@@ -7,5 +7,5 @@
|
|
7 |
"sym": true,
|
8 |
"true_sequential": true,
|
9 |
"model_name_or_path": null,
|
10 |
-
"model_file_base_name":
|
11 |
}
|
|
|
7 |
"sym": true,
|
8 |
"true_sequential": true,
|
9 |
"model_name_or_path": null,
|
10 |
+
"model_file_base_name": "model"
|
11 |
}
|