xingzhang commited on
Commit
ae343e1
1 Parent(s): 31c8136

update readme

Browse files
Files changed (1) hide show
  1. README.md +57 -61
README.md CHANGED
@@ -256,25 +256,22 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
256
 
257
  #### C-Eval
258
 
259
- 在[C-Eval](https://arxiv.org/abs/2305.08322)验证集上,我们评价了Qwen-1.8B-Chat模型的zero-shot准确率
260
-
261
- We demonstrate the zero-shot accuracy of Qwen-1.8B-Chat on C-Eval validation set
262
-
263
- | Model | Avg. Acc. |
264
- |:------------------------:|:---------:|
265
- | **Qwen-7B-Chat** | 54.2 |
266
- | InternLM-7B-Chat | 53.2 |
267
- | **Qwen-1.8B-Chat** | 55.6 |
268
- | ChatGLM2-6B-Chat | 50.7 |
269
- | Baichuan-13B-Chat | 50.4 |
270
- | Chinese-Alpaca-Plus-13B | 43.3 |
271
- | Chinese-Alpaca-2-7B | 41.3 |
272
- | LLaMA2-13B-Chat | 40.6 |
273
- | LLaMA2-7B-Chat | 31.9 |
274
- | OpenLLaMA-Chinese-3B | 24.4 |
275
- | Firefly-Bloom-1B4 | 23.6 |
276
- | OpenBuddy-3B | 23.5 |
277
- | RedPajama-INCITE-Chat-3B | 18.3 |
278
 
279
  C-Eval测试集上,Qwen-1.8B-Chat模型的zero-shot准确率结果如下:
280
 
@@ -282,35 +279,35 @@ The zero-shot accuracy of Qwen-1.8B-Chat on C-Eval testing set is provided below
282
 
283
  | Model | Avg. | STEM | Social Sciences | Humanities | Others |
284
  | :---------------------: | :------: | :--: | :-------------: | :--------: | :----: |
285
- | **Qwen-7B-Chat** | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 |
286
- | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
287
- | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
288
- | **Qwen-1.8B-Chat** | 53.8 | 48.4 | 68.0 | 56.5 | 48.3 |
289
  | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
290
  | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
 
 
 
 
291
 
292
  ### 英文评测(English Evaluation)
293
 
294
  #### MMLU
295
 
296
- [MMLU](https://arxiv.org/abs/2009.03300)评测集上,Qwen-1.8B-Chat模型的zero-shot准确率如下,效果同样在同类对齐模型中同样表现较优。
297
 
298
- The zero-shot accuracy of Qwen-1.8B-Chat on MMLU is provided below.
299
  The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
300
 
301
- | Model | Avg. Acc. |
302
- |:------------------------:|:---------:|
303
- | **Qwen-7B-Chat** | 53.9 |
304
- | ChatGLM2-12B-Chat | 52.1 |
305
- | Baichuan-13B-Chat | 52.1 |
306
- | InternLM-7B-Chat | 50.8 |
307
- | LLaMA2-7B-Chat | 47.0 |
308
- | ChatGLM2-6B-Chat | 45.5 |
309
- | **Qwen-1.8B-Chat** | 43.3 |
310
- | OpenLLaMA-Chinese-3B | 25.7 |
311
- | OpenBuddy-3B | 25.5 |
312
- | RedPajama-INCITE-Chat-3B | 25.5 |
313
- | Firefly-Bloom-1B4 | 23.8 |
314
 
315
  ### 代码评测(Coding Evaluation)
316
 
@@ -320,16 +317,16 @@ The zero-shot Pass@1 of Qwen-1.8B-Chat on [HumanEval](https://github.com/openai/
320
 
321
  | Model | Pass@1 |
322
  |:------------------------:|:------:|
323
- | **Qwen-7B-Chat** | 24.4 |
324
- | LLaMA2-13B-Chat | 18.9 |
325
- | Baichuan-13B-Chat | 16.5 |
326
- | InternLM-7B-Chat | 14.0 |
327
- | LLaMA2-7B-Chat | 12.2 |
328
- | **Qwen-1.8B-Chat** | 26.2 |
329
- | OpenBuddy-3B | 10.4 |
330
- | RedPajama-INCITE-Chat-3B | 6.1 |
331
- | OpenLLaMA-Chinese-3B | 4.9 |
332
  | Firefly-Bloom-1B4 | 0.6 |
 
 
 
 
 
 
 
 
 
333
 
334
  ### 数学评测(Mathematics Evaluation)
335
 
@@ -337,20 +334,19 @@ The zero-shot Pass@1 of Qwen-1.8B-Chat on [HumanEval](https://github.com/openai/
337
 
338
  The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
339
 
340
- | Model | Zero-shot Acc. | 4-shot Acc. |
341
- |:------------------------:|:--------------:|:-----------:|
342
- | **Qwen-7B-Chat** | 41.1 | 43.5 |
343
- | ChatGLM2-12B-Chat | - | 38.1 |
344
- | Baichuan-13B-Chat | - | 36.3 |
345
- | InternLM-7B-Chat | 32.6 | 34.5 |
346
- | LLaMA2-13B-Chat | 29.4 | 36.7 |
347
- | **Qwen-1.8B-Chat** | 33.7 | 30.2 |
348
- | LLaMA2-7B-Chat | 20.4 | 28.2 |
349
- | ChatGLM2-6B-Chat | - | 28.0 |
350
- | OpenBuddy-3B | 10.6 | 12.6 |
351
- | OpenLLaMA-Chinese-3B | 2.6 | 3.0 |
352
- | RedPajama-INCITE-Chat-3B | 2.5 | 2.5 |
353
- | Firefly-Bloom-1B4 | 2.4 | 1.8 |
354
 
355
  ## 评测复现(Reproduction)
356
 
 
256
 
257
  #### C-Eval
258
 
259
+ 在[C-Eval](https://arxiv.org/abs/2305.08322)验证集上,我们评价了Qwen-1.8B-Chat模型的准确率
260
+
261
+ We demonstrate the accuracy of Qwen-1.8B-Chat on C-Eval validation set
262
+
263
+ | Model | Acc. |
264
+ |:--------------------------------:|:---------:|
265
+ | RedPajama-INCITE-Chat-3B | 18.3 |
266
+ | OpenBuddy-3B | 23.5 |
267
+ | Firefly-Bloom-1B4 | 23.6 |
268
+ | OpenLLaMA-Chinese-3B | 24.4 |
269
+ | LLaMA2-7B-Chat | 31.9 |
270
+ | ChatGLM2-6B-Chat | 52.6 |
271
+ | InternLM-7B-Chat | 53.6 |
272
+ | **Qwen-1.8B-Chat (0-shot)** | 55.6 |
273
+ | **Qwen-7B-Chat (0-shot)** | 59.7 |
274
+ | **Qwen-7B-Chat (5-shot)** | 59.3 |
 
 
 
275
 
276
  C-Eval测试集上,Qwen-1.8B-Chat模型的zero-shot准确率结果如下:
277
 
 
279
 
280
  | Model | Avg. | STEM | Social Sciences | Humanities | Others |
281
  | :---------------------: | :------: | :--: | :-------------: | :--------: | :----: |
 
 
 
 
282
  | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
283
  | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
284
+ | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
285
+ | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
286
+ | **Qwen-1.8B-Chat** | 53.8 | 48.4 | 68.0 | 56.5 | 48.3 |
287
+ | **Qwen-7B-Chat** | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 |
288
 
289
  ### 英文评测(English Evaluation)
290
 
291
  #### MMLU
292
 
293
+ [MMLU](https://arxiv.org/abs/2009.03300)评测集上,Qwen-1.8B-Chat模型的准确率如下,效果同样在同类对齐模型中同样表现较优。
294
 
295
+ The accuracy of Qwen-1.8B-Chat on MMLU is provided below.
296
  The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
297
 
298
+ | Model | Acc. |
299
+ |:--------------------------------:|:---------:|
300
+ | Firefly-Bloom-1B4 | 23.8 |
301
+ | OpenBuddy-3B | 25.5 |
302
+ | RedPajama-INCITE-Chat-3B | 25.5 |
303
+ | OpenLLaMA-Chinese-3B | 25.7 |
304
+ | ChatGLM2-6B-Chat | 46.0 |
305
+ | LLaMA2-7B-Chat | 46.2 |
306
+ | InternLM-7B-Chat | 51.1 |
307
+ | Baichuan2-7B-Chat | 52.9 |
308
+ | **Qwen-1.8B-Chat (0-shot)** | 43.3 |
309
+ | **Qwen-7B-Chat (0-shot)** | 55.8 |
310
+ | **Qwen-7B-Chat (5-shot)** | 57.0 |
311
 
312
  ### 代码评测(Coding Evaluation)
313
 
 
317
 
318
  | Model | Pass@1 |
319
  |:------------------------:|:------:|
 
 
 
 
 
 
 
 
 
320
  | Firefly-Bloom-1B4 | 0.6 |
321
+ | OpenLLaMA-Chinese-3B | 4.9 |
322
+ | RedPajama-INCITE-Chat-3B | 6.1 |
323
+ | OpenBuddy-3B | 10.4 |
324
+ | ChatGLM2-6B-Chat | 11.0 |
325
+ | LLaMA2-7B-Chat | 12.2 |
326
+ | Baichuan2-7B-Chat | 13.4 |
327
+ | InternLM-7B-Chat | 14.6 |
328
+ | **Qwen-1.8B-Chat** | 26.2 |
329
+ | **Qwen-7B-Chat** | 37.2 |
330
 
331
  ### 数学评测(Mathematics Evaluation)
332
 
 
334
 
335
  The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
336
 
337
+ | Model | Acc. |
338
+ |:------------------------------------:|:--------:|
339
+ | Firefly-Bloom-1B4 | 2.4 |
340
+ | RedPajama-INCITE-Chat-3B | 2.5 |
341
+ | OpenLLaMA-Chinese-3B | 3.0 |
342
+ | OpenBuddy-3B | 12.6 |
343
+ | LLaMA2-7B-Chat | 26.3 |
344
+ | ChatGLM2-6B-Chat | 28.8 |
345
+ | Baichuan2-7B-Chat | 32.8 |
346
+ | InternLM-7B-Chat | 33.0 |
347
+ | **Qwen-1.8B-Chat (0-shot)** | 33.7 |
348
+ | **Qwen-7B-Chat (0-shot)** | 50.3 |
349
+ | **Qwen-7B-Chat (8-shot)** | 54.1 |
 
350
 
351
  ## 评测复现(Reproduction)
352