puneeshkhanna commited on
Commit
06d0360
·
verified ·
1 Parent(s): 2843dfe

Update eval results with fewshot_as_multiturn

Browse files
Files changed (1) hide show
  1. README.md +30 -30
README.md CHANGED
@@ -184,7 +184,7 @@ print(response)
184
  ## Benchmarks
185
  We report in the following table our internal pipeline benchmarks.
186
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
187
- - We report **raw scores** obtained by applying chat template **without fewshot_as_multiturn** (unlike Llama3.1).
188
  - We use same batch-size across all models.
189
 
190
 
@@ -210,51 +210,51 @@ We report in the following table our internal pipeline benchmarks.
210
  <tr>
211
  <td rowspan="3">General</td>
212
  <td>MMLU (5-shot)</td>
213
- <td>70</td>
214
- <td>65.9</td>
215
- <td><b>71.6</td>
216
  </tr>
217
  <tr>
218
  <td>MMLU-PRO (5-shot)</td>
219
- <td>39.6</td>
220
- <td>32.7</td>
221
- <td><b>44</td>
222
  </tr>
223
  <tr>
224
  <td>IFEval</td>
225
  <td>57.6</td>
226
  <td>63.4</td>
227
- <td><b>78</td>
228
  </tr>
229
  <tr>
230
  <td rowspan="3">Math</td>
231
  <td>GSM8K (5-shot)</td>
232
- <td>76.6</td>
233
- <td>73.8</td>
234
- <td><b>83.1</td>
235
  </tr>
236
  <tr>
237
  <td>GSM8K (8-shot, COT)</td>
238
  <td>78.5</td>
239
  <td>73.6</td>
240
- <td><b>81.3</td>
241
  </tr>
242
  <tr>
243
  <td>MATH Lvl-5 (4-shot)</td>
244
- <td>8.8</td>
245
- <td>0.4</td>
246
- <td><b>22.1</td>
247
  </tr>
248
  <tr>
249
  <td rowspan="5">Reasoning</td>
250
  <td>Arc Challenge (25-shot)</td>
251
- <td>51.9</td>
252
- <td>61.6</td>
253
- <td><b>64.5</td>
254
  </tr>
255
  <tr>
256
  <td>GPQA (0-shot)</td>
257
- <td><b>35.4</td>
258
  <td>33.2</td>
259
  <td>33.5</td>
260
  </tr>
@@ -262,32 +262,32 @@ We report in the following table our internal pipeline benchmarks.
262
  <td>GPQA (0-shot, COT)</td>
263
  <td>16</td>
264
  <td>12.7</td>
265
- <td><b>32.6</td>
266
  </tr>
267
  <tr>
268
  <td>MUSR (0-shot)</td>
269
- <td><b>41.9</td>
270
  <td>38.1</td>
271
  <td>41.1</td>
272
  </tr>
273
  <tr>
274
  <td>BBH (3-shot)</td>
275
- <td>49.2</td>
276
- <td>43.6</td>
277
- <td><b>58.4</td>
278
  </tr>
279
  <tr>
280
  <td rowspan="4">CommonSense Understanding</td>
281
  <td>PIQA (0-shot)</td>
282
  <td>76.4</td>
283
  <td>78.2</td>
284
- <td><b>78.4</td>
285
  </tr>
286
  <tr>
287
  <td>SciQ (0-shot)</td>
288
  <td>61.7</td>
289
  <td>76.4</td>
290
- <td><b>90.4</td>
291
  </tr>
292
  <tr>
293
  <td>Winogrande (0-shot)</td>
@@ -299,19 +299,19 @@ We report in the following table our internal pipeline benchmarks.
299
  <td>OpenbookQA (0-shot)</td>
300
  <td>43.2</td>
301
  <td>47.4</td>
302
- <td><b>48.2</td>
303
  </tr>
304
  <tr>
305
  <td rowspan="2">Instructions following</td>
306
  <td>MT-Bench (avg)</td>
307
  <td>8.28</td>
308
- <td><b>8.6</td>
309
  <td>8.17</td>
310
  </tr>
311
  <tr>
312
  <td>Alpaca (WC)</td>
313
  <td>25.81</td>
314
- <td><b>45.44</td>
315
  <td>24.7</td>
316
  </tr>
317
  <tr>
@@ -319,7 +319,7 @@ We report in the following table our internal pipeline benchmarks.
319
  <td>BFCL AST (avg)</td>
320
  <td>48.4</td>
321
  <td>74.2</td>
322
- <td><b>86.3</td>
323
  </tr>
324
  <tr>
325
  <td rowspan="2">Code</td>
 
184
  ## Benchmarks
185
  We report in the following table our internal pipeline benchmarks.
186
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
187
+ - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
188
  - We use same batch-size across all models.
189
 
190
 
 
210
  <tr>
211
  <td rowspan="3">General</td>
212
  <td>MMLU (5-shot)</td>
213
+ <td>68.8</td>
214
+ <td>66.0</td>
215
+ <td><b>73.9</b></td>
216
  </tr>
217
  <tr>
218
  <td>MMLU-PRO (5-shot)</td>
219
+ <td>38.8</td>
220
+ <td>34.3</td>
221
+ <td><b>44</b></td>
222
  </tr>
223
  <tr>
224
  <td>IFEval</td>
225
  <td>57.6</td>
226
  <td>63.4</td>
227
+ <td><b>78</b></td>
228
  </tr>
229
  <tr>
230
  <td rowspan="3">Math</td>
231
  <td>GSM8K (5-shot)</td>
232
+ <td>77.1</td>
233
+ <td>77.6</td>
234
+ <td><b>84.9</b></td>
235
  </tr>
236
  <tr>
237
  <td>GSM8K (8-shot, COT)</td>
238
  <td>78.5</td>
239
  <td>73.6</td>
240
+ <td><b>81.3</b></td>
241
  </tr>
242
  <tr>
243
  <td>MATH Lvl-5 (4-shot)</td>
244
+ <td>3.3</td>
245
+ <td>5.9</td>
246
+ <td><b>22.1</b></td>
247
  </tr>
248
  <tr>
249
  <td rowspan="5">Reasoning</td>
250
  <td>Arc Challenge (25-shot)</td>
251
+ <td>58.3</td>
252
+ <td>63.4</td>
253
+ <td><b>66.2</b></td>
254
  </tr>
255
  <tr>
256
  <td>GPQA (0-shot)</td>
257
+ <td><b>35.6</b></td>
258
  <td>33.2</td>
259
  <td>33.5</td>
260
  </tr>
 
262
  <td>GPQA (0-shot, COT)</td>
263
  <td>16</td>
264
  <td>12.7</td>
265
+ <td><b>32.6</b></td>
266
  </tr>
267
  <tr>
268
  <td>MUSR (0-shot)</td>
269
+ <td><b>41.9</b></td>
270
  <td>38.1</td>
271
  <td>41.1</td>
272
  </tr>
273
  <tr>
274
  <td>BBH (3-shot)</td>
275
+ <td>50.6</td>
276
+ <td>47.5</td>
277
+ <td><b>58.4</b></td>
278
  </tr>
279
  <tr>
280
  <td rowspan="4">CommonSense Understanding</td>
281
  <td>PIQA (0-shot)</td>
282
  <td>76.4</td>
283
  <td>78.2</td>
284
+ <td><b>78.4</b></td>
285
  </tr>
286
  <tr>
287
  <td>SciQ (0-shot)</td>
288
  <td>61.7</td>
289
  <td>76.4</td>
290
+ <td><b>90.4</b></td>
291
  </tr>
292
  <tr>
293
  <td>Winogrande (0-shot)</td>
 
299
  <td>OpenbookQA (0-shot)</td>
300
  <td>43.2</td>
301
  <td>47.4</td>
302
+ <td><b>48.2</b></td>
303
  </tr>
304
  <tr>
305
  <td rowspan="2">Instructions following</td>
306
  <td>MT-Bench (avg)</td>
307
  <td>8.28</td>
308
+ <td><b>8.6</b></td>
309
  <td>8.17</td>
310
  </tr>
311
  <tr>
312
  <td>Alpaca (WC)</td>
313
  <td>25.81</td>
314
+ <td><b>45.44</b></td>
315
  <td>24.7</td>
316
  </tr>
317
  <tr>
 
319
  <td>BFCL AST (avg)</td>
320
  <td>48.4</td>
321
  <td>74.2</td>
322
+ <td><b>86.3</b></td>
323
  </tr>
324
  <tr>
325
  <td rowspan="2">Code</td>