alexmarques commited on
Commit
d635de5
1 Parent(s): 69291a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -15
README.md CHANGED
@@ -125,10 +125,11 @@ model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w8a16")
125
 
126
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
127
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
128
- This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
129
 
130
  ### Accuracy
131
 
 
132
  <table>
133
  <tr>
134
  <td><strong>Benchmark</strong>
@@ -143,29 +144,39 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
143
  <tr>
144
  <td>MMLU (5-shot)
145
  </td>
146
- <td>67.94
147
  </td>
148
- <td>68.09
149
  </td>
150
- <td>100.2%
151
  </td>
152
  </tr>
153
  <tr>
154
- <td>ARC Challenge (0-shot)
155
  </td>
156
- <td>83.19
157
  </td>
158
- <td>82.68
159
  </td>
160
  <td>99.4%
161
  </td>
162
  </tr>
 
 
 
 
 
 
 
 
 
 
163
  <tr>
164
  <td>GSM-8K (CoT, 8-shot, strict-match)
165
  </td>
166
  <td>82.79
167
  </td>
168
- <td>82.64
169
  </td>
170
  <td>99.8%
171
  </td>
@@ -175,7 +186,7 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
175
  </td>
176
  <td>80.01
177
  </td>
178
- <td>80.21
179
  </td>
180
  <td>100.3%
181
  </td>
@@ -203,9 +214,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
203
  <tr>
204
  <td><strong>Average</strong>
205
  </td>
206
- <td><strong>74.31</strong>
207
  </td>
208
- <td><strong>74.17</strong>
209
  </td>
210
  <td><strong>99.8%</strong>
211
  </td>
@@ -220,17 +231,30 @@ The results were obtained using the following commands:
220
  ```
221
  lm_eval \
222
  --model vllm \
223
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
224
- --tasks mmlu \
 
 
225
  --num_fewshot 5 \
226
  --batch_size auto
227
  ```
228
 
 
 
 
 
 
 
 
 
 
 
 
229
  #### ARC-Challenge
230
  ```
231
  lm_eval \
232
  --model vllm \
233
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
234
  --tasks arc_challenge_llama_3.1_instruct \
235
  --apply_chat_template \
236
  --num_fewshot 0 \
@@ -241,7 +265,7 @@ lm_eval \
241
  ```
242
  lm_eval \
243
  --model vllm \
244
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
245
  --tasks gsm8k_cot_llama_3.1_instruct \
246
  --fewshot_as_multiturn \
247
  --apply_chat_template \
 
125
 
126
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
127
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
128
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
129
 
130
  ### Accuracy
131
 
132
+ #### Open LLM Leaderboard evaluation scores
133
  <table>
134
  <tr>
135
  <td><strong>Benchmark</strong>
 
144
  <tr>
145
  <td>MMLU (5-shot)
146
  </td>
147
+ <td>69.43
148
  </td>
149
+ <td>69.37
150
  </td>
151
+ <td>99.9%
152
  </td>
153
  </tr>
154
  <tr>
155
+ <td>MMLU (CoT, 0-shot)
156
  </td>
157
+ <td>72.56
158
  </td>
159
+ <td>72.14
160
  </td>
161
  <td>99.4%
162
  </td>
163
  </tr>
164
+ <tr>
165
+ <td>ARC Challenge (0-shot)
166
+ </td>
167
+ <td>81.57
168
+ </td>
169
+ <td>81.48
170
+ </td>
171
+ <td>99.9%
172
+ </td>
173
+ </tr>
174
  <tr>
175
  <td>GSM-8K (CoT, 8-shot, strict-match)
176
  </td>
177
  <td>82.79
178
  </td>
179
+ <td>81.64
180
  </td>
181
  <td>99.8%
182
  </td>
 
186
  </td>
187
  <td>80.01
188
  </td>
189
+ <td>80.1
190
  </td>
191
  <td>100.3%
192
  </td>
 
214
  <tr>
215
  <td><strong>Average</strong>
216
  </td>
217
+ <td><strong>74.04</strong>
218
  </td>
219
+ <td><strong>73.89</strong>
220
  </td>
221
  <td><strong>99.8%</strong>
222
  </td>
 
231
  ```
232
  lm_eval \
233
  --model vllm \
234
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
235
+ --tasks mmlu_llama_3.1_instruct \
236
+ --fewshot_as_multiturn \
237
+ --apply_chat_template \
238
  --num_fewshot 5 \
239
  --batch_size auto
240
  ```
241
 
242
+ #### MMLU-CoT
243
+ ```
244
+ lm_eval \
245
+ --model vllm \
246
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
247
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
248
+ --apply_chat_template \
249
+ --num_fewshot 0 \
250
+ --batch_size auto
251
+ ```
252
+
253
  #### ARC-Challenge
254
  ```
255
  lm_eval \
256
  --model vllm \
257
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
258
  --tasks arc_challenge_llama_3.1_instruct \
259
  --apply_chat_template \
260
  --num_fewshot 0 \
 
265
  ```
266
  lm_eval \
267
  --model vllm \
268
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
269
  --tasks gsm8k_cot_llama_3.1_instruct \
270
  --fewshot_as_multiturn \
271
  --apply_chat_template \