allenai
/

Llama-3.1-Tulu-3.1-8B

@@ -104,21 +104,22 @@ See the Falcon 180B model card for an example of this.
 ## Performance
-| Benchmark (eval)                | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
-|---------------------------------|----------------|----------------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
-| **Avg.**                        | 60.4           | 64.4           | **64.8**   | 62.2                  | 57.8                | 44.7      | 55.2               | 58.3                 |
-| **MMLU (0 shot, CoT)**          | 65.9           | 68.7           | 68.2       | 71.2                  | **76.6**            | 62.0      | 74.6               | 68.5                 |
-| **PopQA (15 shot)**             | **29.3**       | 29.3           | 29.1       | 20.2                  | 18.1                | 22.5      | 28.3               | 20.2                 |
-| **TruthfulQA (6 shot)**         | 46.8           | 56.1           | 55.0       | 55.1                  | **63.1**            | 57.0      | 61.4               | 55.5                 |
-| **BigBenchHard (3 shot, CoT)**  | **67.9**       | 65.8           | 66.0       | 62.8                  | 21.7                | 0.9       | 2.5                | 56.2                 |
-| **DROP (3 shot)**               | 61.3           | 62.5           | **62.6**   | 61.5                  | 54.4                | 49.4      | 58.8               | 56.2                 |
-| **MATH (4 shot CoT, Flex)**     | 31.5           | 42.0           | **43.7**   | 42.5                  | 14.8                | 5.1       | 29.8               | 40.0                 |
-| **GSM8K (8 shot, CoT)**         | 76.2           | 84.3           | **87.6**   | 83.4                  | 83.8                | 61.2      | 79.7               | 80.0                 |
-| **HumanEval (pass@10)**         | 86.2           | 83.9           | 83.9       | 86.3                  | **93.1**            | 75.4      | 71.7               | 91.0                 |
-| **HumanEval+ (pass@10)**        | 81.4           | 78.6           | 79.2       | 82.9                  | **89.7**            | 69.1      | 67.0               | 88.5                 |
-| **IFEval (prompt loose)**       | 72.8           | 81.1           | **82.4**   | 80.6                  | 74.7                | 38.8      | 69.9               | 56.4                 |
-| **AlpacaEval 2 (LC % win)**     | 12.4           | 33.5           | 34.5       | 24.2                  | 29.0                | **49.0**  | 43.7               | 31.4                 |
-| **Safety (6 task avg.)**        | **93.1**       | 87.2           | 85.5       | 75.2                  | 75.0                | 46.4      | 75.5               | 56.2                 |
 *Note, see the updated version of the paper for the latest, fixed evaluations that improve scores for models such as Qwen 2.5 Instruct.*

 ## Performance
+| Benchmark (eval)                | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | **Tülu 3.1 8B (NEW)** | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
+|---------------------------------|--------------|--------------|-----------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
+| **Avg.**                        | 60.4         | 64.4         | 64.8  | 66.3       | 62.2                  | **66.5**            | 44.7      | 55.2               | 58.3                 |
+| **MMLU (0 shot, CoT)**          | 65.9         | 68.7         | 68.2      | 69.5       | 71.2                  | **76.6**            | 62.0      | 74.6               | 68.5                 |
+| **PopQA (15 shot)**             | **29.3**     | 29.3         | 29.1      | 30.2       | 20.2                  | 18.1                | 22.5      | 28.3               | 20.2                 |
+| **TruthfulQA (6 shot)**         | 46.8         | 56.1         | 55.0      | 59.9       | 55.1                  | **63.1**            | 57.0      | 61.4               | 55.5                 |
+| **BigBenchHard (3 shot, CoT)**  | **67.9**     | 65.8         | 66.0      | 68.9       | 62.8                  | 70.2                | 0.9       | 2.5                | 56.2                 |
+| **DROP (3 shot)**               | 61.3         | 62.5         | 62.6  | **63.9**   | 61.5                  | 54.4                | 49.4      | 58.8               | 56.2                 |
+| **MATH (4 shot CoT, Flex)**     | 31.5         | 42.0         |43.7  | 47.8       | 42.5                  | **69.9**            | 5.1       | 29.8               | 40.0                 |
+| **GSM8K (8 shot, CoT)**         | 76.2         | 84.3         | 87.6  | **90.0**   | 83.4                  | 83.8                | 61.2      | 79.7               | 80.0                 |
+| **HumanEval (pass@10)**         | 86.2         | 83.9         | 83.9      | 84.8       | 86.3                  | **93.1**            | 75.4      | 71.7               | 91.0                 |
+| **HumanEval+ (pass@10)**        | 81.4         | 78.6         | 79.2      | 80.4       | 82.9                  | **89.7**            | 69.1      | 67.0               | 88.5                 |
+| **IFEval (prompt loose)**       | 72.8         | 81.1         | 82.4  | **83.9**   | 80.6                  | 74.7                | 38.8      | 69.9               | 56.4                 |
+| **AlpacaEval 2 (LC % win)**     | 12.4         | 33.5         | 34.5      | 34.9       | 24.2                  | 29.0                | **49.0**  | 43.7               | 31.4                 |
+| **Safety (6 task avg.)**        | **93.1**     | 87.2         | 85.5      | 81.2       | 75.2                  | 75.0                | 46.4      | 75.5               | 56.2                 |
 *Note, see the updated version of the paper for the latest, fixed evaluations that improve scores for models such as Qwen 2.5 Instruct.*