Update README.md
Browse files
README.md
CHANGED
@@ -104,21 +104,22 @@ See the Falcon 180B model card for an example of this.
|
|
104 |
|
105 |
## Performance
|
106 |
|
107 |
-
| Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
|
108 |
-
|
109 |
-
| **Avg.** | 60.4
|
110 |
-
| **MMLU (0 shot, CoT)** | 65.9
|
111 |
-
| **PopQA (15 shot)** | **29.3**
|
112 |
-
| **TruthfulQA (6 shot)** | 46.8
|
113 |
-
| **BigBenchHard (3 shot, CoT)** | **67.9**
|
114 |
-
| **DROP (3 shot)** | 61.3
|
115 |
-
| **MATH (4 shot CoT, Flex)** | 31.5
|
116 |
-
| **GSM8K (8 shot, CoT)** | 76.2
|
117 |
-
| **HumanEval (pass@10)** | 86.2
|
118 |
-
| **HumanEval+ (pass@10)** | 81.4
|
119 |
-
| **IFEval (prompt loose)** | 72.8
|
120 |
-
| **AlpacaEval 2 (LC % win)** | 12.4
|
121 |
-
| **Safety (6 task avg.)** | **93.1**
|
|
|
122 |
|
123 |
*Note, see the updated version of the paper for the latest, fixed evaluations that improve scores for models such as Qwen 2.5 Instruct.*
|
124 |
|
|
|
104 |
|
105 |
## Performance
|
106 |
|
107 |
+
| Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | **Tülu 3.1 8B (NEW)** | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
|
108 |
+
|---------------------------------|--------------|--------------|-----------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
|
109 |
+
| **Avg.** | 60.4 | 64.4 | 64.8 | 66.3 | 62.2 | **66.5** | 44.7 | 55.2 | 58.3 |
|
110 |
+
| **MMLU (0 shot, CoT)** | 65.9 | 68.7 | 68.2 | 69.5 | 71.2 | **76.6** | 62.0 | 74.6 | 68.5 |
|
111 |
+
| **PopQA (15 shot)** | **29.3** | 29.3 | 29.1 | 30.2 | 20.2 | 18.1 | 22.5 | 28.3 | 20.2 |
|
112 |
+
| **TruthfulQA (6 shot)** | 46.8 | 56.1 | 55.0 | 59.9 | 55.1 | **63.1** | 57.0 | 61.4 | 55.5 |
|
113 |
+
| **BigBenchHard (3 shot, CoT)** | **67.9** | 65.8 | 66.0 | 68.9 | 62.8 | 70.2 | 0.9 | 2.5 | 56.2 |
|
114 |
+
| **DROP (3 shot)** | 61.3 | 62.5 | 62.6 | **63.9** | 61.5 | 54.4 | 49.4 | 58.8 | 56.2 |
|
115 |
+
| **MATH (4 shot CoT, Flex)** | 31.5 | 42.0 |43.7 | 47.8 | 42.5 | **69.9** | 5.1 | 29.8 | 40.0 |
|
116 |
+
| **GSM8K (8 shot, CoT)** | 76.2 | 84.3 | 87.6 | **90.0** | 83.4 | 83.8 | 61.2 | 79.7 | 80.0 |
|
117 |
+
| **HumanEval (pass@10)** | 86.2 | 83.9 | 83.9 | 84.8 | 86.3 | **93.1** | 75.4 | 71.7 | 91.0 |
|
118 |
+
| **HumanEval+ (pass@10)** | 81.4 | 78.6 | 79.2 | 80.4 | 82.9 | **89.7** | 69.1 | 67.0 | 88.5 |
|
119 |
+
| **IFEval (prompt loose)** | 72.8 | 81.1 | 82.4 | **83.9** | 80.6 | 74.7 | 38.8 | 69.9 | 56.4 |
|
120 |
+
| **AlpacaEval 2 (LC % win)** | 12.4 | 33.5 | 34.5 | 34.9 | 24.2 | 29.0 | **49.0** | 43.7 | 31.4 |
|
121 |
+
| **Safety (6 task avg.)** | **93.1** | 87.2 | 85.5 | 81.2 | 75.2 | 75.0 | 46.4 | 75.5 | 56.2 |
|
122 |
+
|
123 |
|
124 |
*Note, see the updated version of the paper for the latest, fixed evaluations that improve scores for models such as Qwen 2.5 Instruct.*
|
125 |
|