natolambert commited on
Commit
2f0b4fd
·
verified ·
1 Parent(s): 307ea47

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -15
README.md CHANGED
@@ -104,21 +104,22 @@ See the Falcon 180B model card for an example of this.
104
 
105
  ## Performance
106
 
107
- | Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
108
- |---------------------------------|----------------|----------------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
109
- | **Avg.** | 60.4 | 64.4 | **64.8** | 62.2 | 57.8 | 44.7 | 55.2 | 58.3 |
110
- | **MMLU (0 shot, CoT)** | 65.9 | 68.7 | 68.2 | 71.2 | **76.6** | 62.0 | 74.6 | 68.5 |
111
- | **PopQA (15 shot)** | **29.3** | 29.3 | 29.1 | 20.2 | 18.1 | 22.5 | 28.3 | 20.2 |
112
- | **TruthfulQA (6 shot)** | 46.8 | 56.1 | 55.0 | 55.1 | **63.1** | 57.0 | 61.4 | 55.5 |
113
- | **BigBenchHard (3 shot, CoT)** | **67.9** | 65.8 | 66.0 | 62.8 | 21.7 | 0.9 | 2.5 | 56.2 |
114
- | **DROP (3 shot)** | 61.3 | 62.5 | **62.6** | 61.5 | 54.4 | 49.4 | 58.8 | 56.2 |
115
- | **MATH (4 shot CoT, Flex)** | 31.5 | 42.0 | **43.7** | 42.5 | 14.8 | 5.1 | 29.8 | 40.0 |
116
- | **GSM8K (8 shot, CoT)** | 76.2 | 84.3 | **87.6** | 83.4 | 83.8 | 61.2 | 79.7 | 80.0 |
117
- | **HumanEval (pass@10)** | 86.2 | 83.9 | 83.9 | 86.3 | **93.1** | 75.4 | 71.7 | 91.0 |
118
- | **HumanEval+ (pass@10)** | 81.4 | 78.6 | 79.2 | 82.9 | **89.7** | 69.1 | 67.0 | 88.5 |
119
- | **IFEval (prompt loose)** | 72.8 | 81.1 | **82.4** | 80.6 | 74.7 | 38.8 | 69.9 | 56.4 |
120
- | **AlpacaEval 2 (LC % win)** | 12.4 | 33.5 | 34.5 | 24.2 | 29.0 | **49.0** | 43.7 | 31.4 |
121
- | **Safety (6 task avg.)** | **93.1** | 87.2 | 85.5 | 75.2 | 75.0 | 46.4 | 75.5 | 56.2 |
 
122
 
123
  *Note, see the updated version of the paper for the latest, fixed evaluations that improve scores for models such as Qwen 2.5 Instruct.*
124
 
 
104
 
105
  ## Performance
106
 
107
+ | Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | **Tülu 3.1 8B (NEW)** | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
108
+ |---------------------------------|--------------|--------------|-----------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
109
+ | **Avg.** | 60.4 | 64.4 | 64.8 | 66.3 | 62.2 | **66.5** | 44.7 | 55.2 | 58.3 |
110
+ | **MMLU (0 shot, CoT)** | 65.9 | 68.7 | 68.2 | 69.5 | 71.2 | **76.6** | 62.0 | 74.6 | 68.5 |
111
+ | **PopQA (15 shot)** | **29.3** | 29.3 | 29.1 | 30.2 | 20.2 | 18.1 | 22.5 | 28.3 | 20.2 |
112
+ | **TruthfulQA (6 shot)** | 46.8 | 56.1 | 55.0 | 59.9 | 55.1 | **63.1** | 57.0 | 61.4 | 55.5 |
113
+ | **BigBenchHard (3 shot, CoT)** | **67.9** | 65.8 | 66.0 | 68.9 | 62.8 | 70.2 | 0.9 | 2.5 | 56.2 |
114
+ | **DROP (3 shot)** | 61.3 | 62.5 | 62.6 | **63.9** | 61.5 | 54.4 | 49.4 | 58.8 | 56.2 |
115
+ | **MATH (4 shot CoT, Flex)** | 31.5 | 42.0 |43.7 | 47.8 | 42.5 | **69.9** | 5.1 | 29.8 | 40.0 |
116
+ | **GSM8K (8 shot, CoT)** | 76.2 | 84.3 | 87.6 | **90.0** | 83.4 | 83.8 | 61.2 | 79.7 | 80.0 |
117
+ | **HumanEval (pass@10)** | 86.2 | 83.9 | 83.9 | 84.8 | 86.3 | **93.1** | 75.4 | 71.7 | 91.0 |
118
+ | **HumanEval+ (pass@10)** | 81.4 | 78.6 | 79.2 | 80.4 | 82.9 | **89.7** | 69.1 | 67.0 | 88.5 |
119
+ | **IFEval (prompt loose)** | 72.8 | 81.1 | 82.4 | **83.9** | 80.6 | 74.7 | 38.8 | 69.9 | 56.4 |
120
+ | **AlpacaEval 2 (LC % win)** | 12.4 | 33.5 | 34.5 | 34.9 | 24.2 | 29.0 | **49.0** | 43.7 | 31.4 |
121
+ | **Safety (6 task avg.)** | **93.1** | 87.2 | 85.5 | 81.2 | 75.2 | 75.0 | 46.4 | 75.5 | 56.2 |
122
+
123
 
124
  *Note, see the updated version of the paper for the latest, fixed evaluations that improve scores for models such as Qwen 2.5 Instruct.*
125