Update README.md
Browse files
README.md
CHANGED
@@ -13,6 +13,7 @@ license_link: https://llama.meta.com/llama3/license/
|
|
13 |
- **Input:** Text
|
14 |
- **Output:** Text
|
15 |
- **Model Optimizations:**
|
|
|
16 |
- **Weight quantization:** INT8
|
17 |
- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct), this models is intended for assistant-like chat.
|
18 |
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
|
@@ -27,10 +28,14 @@ It achieves an average score of 79.18 on the [OpenLLM](https://huggingface.co/sp
|
|
27 |
### Model Optimizations
|
28 |
|
29 |
This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
|
30 |
-
This optimization reduces the number of bits
|
|
|
31 |
|
32 |
-
Only
|
33 |
-
|
|
|
|
|
|
|
34 |
|
35 |
|
36 |
## Deployment
|
@@ -43,7 +48,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
|
|
43 |
from vllm import LLM, SamplingParams
|
44 |
from transformers import AutoTokenizer
|
45 |
|
46 |
-
model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.
|
47 |
number_gpus = 2
|
48 |
|
49 |
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
|
@@ -69,13 +74,12 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
|
|
69 |
|
70 |
### Use with transformers
|
71 |
|
72 |
-
|
73 |
-
The following example contemplates how the model can be used using the `generate()` function.
|
74 |
|
75 |
```python
|
76 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
77 |
|
78 |
-
model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.
|
79 |
|
80 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
81 |
model = AutoModelForCausalLM.from_pretrained(
|
@@ -114,17 +118,17 @@ print(tokenizer.decode(response, skip_special_tokens=True))
|
|
114 |
|
115 |
## Creation
|
116 |
|
117 |
-
This model was created by
|
118 |
-
Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
|
119 |
|
120 |
```python
|
121 |
from transformers import AutoTokenizer
|
122 |
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
123 |
from datasets import load_dataset
|
|
|
|
|
124 |
|
125 |
-
model_id = "meta-llama/Meta-Llama-3-
|
126 |
|
127 |
-
num_samples =
|
128 |
max_seq_len = 8192
|
129 |
|
130 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
@@ -136,28 +140,31 @@ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
|
|
136 |
ds = ds.shuffle().select(range(num_samples))
|
137 |
ds = ds.map(preprocess_fn)
|
138 |
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
desc_act=False,
|
145 |
-
model_file_base_name="model",
|
146 |
-
damp_percent=0.1,
|
147 |
)
|
148 |
|
149 |
-
model =
|
150 |
model_id,
|
151 |
-
quantize_config,
|
152 |
device_map="auto",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
)
|
154 |
|
155 |
-
model.quantize(examples)
|
156 |
model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a8")
|
157 |
```
|
158 |
|
159 |
|
160 |
-
|
161 |
## Evaluation
|
162 |
|
163 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command (using 2 GPUs):
|
@@ -178,7 +185,7 @@ lm_eval \
|
|
178 |
</td>
|
179 |
<td><strong>Meta-Llama-3-70B-Instruct </strong>
|
180 |
</td>
|
181 |
-
<td><strong>Meta-Llama-3-70B-Instruct-quantized.
|
182 |
</td>
|
183 |
<td><strong>Recovery</strong>
|
184 |
</td>
|
|
|
13 |
- **Input:** Text
|
14 |
- **Output:** Text
|
15 |
- **Model Optimizations:**
|
16 |
+
- **Activation quantization:** INT8
|
17 |
- **Weight quantization:** INT8
|
18 |
- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct), this models is intended for assistant-like chat.
|
19 |
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
|
|
|
28 |
### Model Optimizations
|
29 |
|
30 |
This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
|
31 |
+
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
|
32 |
+
Weight quantization also reduces disk size requirements by approximately 50%.
|
33 |
|
34 |
+
Only weights and activations of the linear operators within transformers blocks are quantized.
|
35 |
+
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
|
36 |
+
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
|
37 |
+
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
|
38 |
+
GPTQ used a 10% damping factor and 256 sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
|
39 |
|
40 |
|
41 |
## Deployment
|
|
|
48 |
from vllm import LLM, SamplingParams
|
49 |
from transformers import AutoTokenizer
|
50 |
|
51 |
+
model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a8"
|
52 |
number_gpus = 2
|
53 |
|
54 |
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
|
|
|
74 |
|
75 |
### Use with transformers
|
76 |
|
77 |
+
The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
|
|
|
78 |
|
79 |
```python
|
80 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
81 |
|
82 |
+
model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a8"
|
83 |
|
84 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
85 |
model = AutoModelForCausalLM.from_pretrained(
|
|
|
118 |
|
119 |
## Creation
|
120 |
|
121 |
+
This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
|
|
|
122 |
|
123 |
```python
|
124 |
from transformers import AutoTokenizer
|
|
|
125 |
from datasets import load_dataset
|
126 |
+
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
|
127 |
+
from llmcompressor.modifiers.quantization import GPTQModifier
|
128 |
|
129 |
+
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
|
130 |
|
131 |
+
num_samples = 256
|
132 |
max_seq_len = 8192
|
133 |
|
134 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
|
140 |
ds = ds.shuffle().select(range(num_samples))
|
141 |
ds = ds.map(preprocess_fn)
|
142 |
|
143 |
+
recipe = GPTQModifier(
|
144 |
+
targets="Linear",
|
145 |
+
scheme="W8A8",
|
146 |
+
ignore=["lm_head"],
|
147 |
+
dampening_frac=0.1,
|
|
|
|
|
|
|
148 |
)
|
149 |
|
150 |
+
model = SparseAutoModelForCausalLM.from_pretrained(
|
151 |
model_id,
|
|
|
152 |
device_map="auto",
|
153 |
+
trust_remote_code=True,
|
154 |
+
)
|
155 |
+
|
156 |
+
oneshot(
|
157 |
+
model=model,
|
158 |
+
dataset=ds,
|
159 |
+
recipe=recipe,
|
160 |
+
max_seq_length=max_seq_len,
|
161 |
+
num_calibration_samples=num_samples,
|
162 |
)
|
163 |
|
|
|
164 |
model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a8")
|
165 |
```
|
166 |
|
167 |
|
|
|
168 |
## Evaluation
|
169 |
|
170 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command (using 2 GPUs):
|
|
|
185 |
</td>
|
186 |
<td><strong>Meta-Llama-3-70B-Instruct </strong>
|
187 |
</td>
|
188 |
+
<td><strong>Meta-Llama-3-70B-Instruct-quantized.w8a8 (this model)</strong>
|
189 |
</td>
|
190 |
<td><strong>Recovery</strong>
|
191 |
</td>
|