alexmarques commited on
Commit
7081ae4
·
verified ·
1 Parent(s): f978f64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -25
README.md CHANGED
@@ -13,6 +13,7 @@ license_link: https://llama.meta.com/llama3/license/
13
  - **Input:** Text
14
  - **Output:** Text
15
  - **Model Optimizations:**
 
16
  - **Weight quantization:** INT8
17
  - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct), this models is intended for assistant-like chat.
18
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
@@ -27,10 +28,14 @@ It achieves an average score of 79.18 on the [OpenLLM](https://huggingface.co/sp
27
  ### Model Optimizations
28
 
29
  This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
30
- This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 
31
 
32
- Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
33
- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 10% damping factor and 128 sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
 
 
 
34
 
35
 
36
  ## Deployment
@@ -43,7 +48,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
43
  from vllm import LLM, SamplingParams
44
  from transformers import AutoTokenizer
45
 
46
- model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"
47
  number_gpus = 2
48
 
49
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
@@ -69,13 +74,12 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
69
 
70
  ### Use with transformers
71
 
72
- This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
73
- The following example contemplates how the model can be used using the `generate()` function.
74
 
75
  ```python
76
  from transformers import AutoTokenizer, AutoModelForCausalLM
77
 
78
- model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"
79
 
80
  tokenizer = AutoTokenizer.from_pretrained(model_id)
81
  model = AutoModelForCausalLM.from_pretrained(
@@ -114,17 +118,17 @@ print(tokenizer.decode(response, skip_special_tokens=True))
114
 
115
  ## Creation
116
 
117
- This model was created by applying the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library as presented in the code snipet below.
118
- Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
119
 
120
  ```python
121
  from transformers import AutoTokenizer
122
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
123
  from datasets import load_dataset
 
 
124
 
125
- model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
126
 
127
- num_samples = 128
128
  max_seq_len = 8192
129
 
130
  tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -136,28 +140,31 @@ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
136
  ds = ds.shuffle().select(range(num_samples))
137
  ds = ds.map(preprocess_fn)
138
 
139
- examples = [tokenizer(example["text"], padding=False, max_length=max_seq_len, truncation=True) for example in ds]
140
-
141
- quantize_config = BaseQuantizeConfig(
142
- bits=8,
143
- group_size=-1,
144
- desc_act=False,
145
- model_file_base_name="model",
146
- damp_percent=0.1,
147
  )
148
 
149
- model = AutoGPTQForCausalLM.from_pretrained(
150
  model_id,
151
- quantize_config,
152
  device_map="auto",
 
 
 
 
 
 
 
 
 
153
  )
154
 
155
- model.quantize(examples)
156
  model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a8")
157
  ```
158
 
159
 
160
-
161
  ## Evaluation
162
 
163
  The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command (using 2 GPUs):
@@ -178,7 +185,7 @@ lm_eval \
178
  </td>
179
  <td><strong>Meta-Llama-3-70B-Instruct </strong>
180
  </td>
181
- <td><strong>Meta-Llama-3-70B-Instruct-quantized.w8a16 (this model)</strong>
182
  </td>
183
  <td><strong>Recovery</strong>
184
  </td>
 
13
  - **Input:** Text
14
  - **Output:** Text
15
  - **Model Optimizations:**
16
+ - **Activation quantization:** INT8
17
  - **Weight quantization:** INT8
18
  - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct), this models is intended for assistant-like chat.
19
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 
28
  ### Model Optimizations
29
 
30
  This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
31
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
32
+ Weight quantization also reduces disk size requirements by approximately 50%.
33
 
34
+ Only weights and activations of the linear operators within transformers blocks are quantized.
35
+ Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
36
+ Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
37
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
38
+ GPTQ used a 10% damping factor and 256 sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
39
 
40
 
41
  ## Deployment
 
48
  from vllm import LLM, SamplingParams
49
  from transformers import AutoTokenizer
50
 
51
+ model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a8"
52
  number_gpus = 2
53
 
54
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
 
74
 
75
  ### Use with transformers
76
 
77
+ The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
 
78
 
79
  ```python
80
  from transformers import AutoTokenizer, AutoModelForCausalLM
81
 
82
+ model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a8"
83
 
84
  tokenizer = AutoTokenizer.from_pretrained(model_id)
85
  model = AutoModelForCausalLM.from_pretrained(
 
118
 
119
  ## Creation
120
 
121
+ This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
 
122
 
123
  ```python
124
  from transformers import AutoTokenizer
 
125
  from datasets import load_dataset
126
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
127
+ from llmcompressor.modifiers.quantization import GPTQModifier
128
 
129
+ model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
130
 
131
+ num_samples = 256
132
  max_seq_len = 8192
133
 
134
  tokenizer = AutoTokenizer.from_pretrained(model_id)
 
140
  ds = ds.shuffle().select(range(num_samples))
141
  ds = ds.map(preprocess_fn)
142
 
143
+ recipe = GPTQModifier(
144
+ targets="Linear",
145
+ scheme="W8A8",
146
+ ignore=["lm_head"],
147
+ dampening_frac=0.1,
 
 
 
148
  )
149
 
150
+ model = SparseAutoModelForCausalLM.from_pretrained(
151
  model_id,
 
152
  device_map="auto",
153
+ trust_remote_code=True,
154
+ )
155
+
156
+ oneshot(
157
+ model=model,
158
+ dataset=ds,
159
+ recipe=recipe,
160
+ max_seq_length=max_seq_len,
161
+ num_calibration_samples=num_samples,
162
  )
163
 
 
164
  model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a8")
165
  ```
166
 
167
 
 
168
  ## Evaluation
169
 
170
  The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command (using 2 GPUs):
 
185
  </td>
186
  <td><strong>Meta-Llama-3-70B-Instruct </strong>
187
  </td>
188
+ <td><strong>Meta-Llama-3-70B-Instruct-quantized.w8a8 (this model)</strong>
189
  </td>
190
  <td><strong>Recovery</strong>
191
  </td>