nvidia
/

AceMath-72B-Instruct

@@ -29,18 +29,16 @@ For more information about AceMath, check our [website](https://research.nvidia.
 ## Benchmark Results
-| | GSM8K | MATH | Minerva Math | GaoKao 2023En | Olympiad Bench | College Math | MMLU STEM | Average |
-| -- |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
-| GPT-4o (2024-0806) | 92.90 | 81.10 | 50.74 | 67.50 | 43.30 | 48.50 | 87.99 | 67.43 |
-| Claude-3.5 Sonnet (2024-1022) | 96.40 | 75.90 | 48.16 | 64.94 | 37.93 | 48.47 | 85.06 | 65.27 |
-| Llama3.1-70B-Instruct | 94.10 | 65.70 | 34.20 | 54.00 | 27.70 | 42.50 | 80.40 | 56.94 |
-| Llama3.1-405B-Instruct | 96.80 | 73.80 | 54.04 | 62.08 | 34.81 | 49.25 | 83.10 | 64.84 |
-| Qwen2.5-Math-1.5B-Instruct | 84.80 | 75.80 | 29.40 | 65.50 | 38.10 | 47.70 | 57.50 | 56.97 |
-| Qwen2.5-Math-7B-Instruct | 95.20 | 83.60 | 37.10 | 66.80 | 41.60 | 46.80 | 71.90 | 63.29 |
-| Qwen2.5-Math-72B-Instruct | 95.90 | 85.90 | 44.10 | 71.90 | 49.00 | 49.50 | 80.80 | 68.16 |
-| AceMath-1.5B-Instruct (Ours) | 86.95 | 76.84 | 41.54 | 64.42 | 33.78 | 54.36 | 62.04 | 59.99 |
-| AceMath-7B-Instruct (Ours) | 93.71 | 83.14 | 51.11 | 68.05 | 42.22 | 56.64 | 75.32 | 67.17 |
-| AceMath-72B-Instruct (Ours) | 96.44 | 86.10 | 56.99 | 72.21 | 48.44 | 57.24 | 85.44 | 71.84 |
 Greedy decoding (pass@1) results on a variety of math reasoning benchmarks. AceMath-7B-Instruct significantly outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (67.2 vs. 62.9) and comes close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin.

 ## Benchmark Results
+| | GPT-4o (2024-0806) | Claude-3.5 Sonnet (2024-1022) | Llama3.1-405B-Instruct | Qwen2.5-Math-1.5B-Instruct | Qwen2.5-Math-7B-Instruct | Qwen2.5-Math-72B-Instruct | AceMath-1.5B-Instruct | AceMath-7B-Instruct | AceMath-72B-Instruct |
+| -------------- |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+| GSM8K          | 92.90 | 96.40 | 96.80 | 84.80 | 95.20 | 95.90 | 86.95 | 93.71 | 96.44 |
+| MATH           | 81.10 | 75.90 | 73.80 | 75.80 | 83.60 | 85.90 | 76.84 | 83.14 | 86.10 |
+| Minerva Math   | 50.74 | 48.16 | 54.04 | 29.40 | 37.10 | 44.10 | 41.54 | 51.11 | 56.99 |
+| GaoKao 2023En  | 67.50 | 64.94 | 62.08 | 65.50 | 66.80 | 71.90 | 64.42 | 68.05 | 72.21 |
+| Olympiad Bench | 43.30 | 37.93 | 34.81 | 38.10 | 41.60 | 49.00 | 33.78 | 42.22 | 48.44 |
+| College Math   | 48.50 | 48.47 | 49.25 | 47.70 | 46.80 | 49.50 | 54.36 | 56.64 | 57.24 |
+| MMLU STEM      | 87.99 | 85.06 | 83.10 | 57.50 | 71.90 | 80.80 | 62.04 | 75.32 | 85.44 |
+| Average        | 67.43 | 65.27 | 64.84 | 56.97 | 63.29 | 68.16 | 59.99 | 67.17 | 71.84 |
 Greedy decoding (pass@1) results on a variety of math reasoning benchmarks. AceMath-7B-Instruct significantly outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (67.2 vs. 62.9) and comes close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin.