Update README.md
Browse files
README.md
CHANGED
@@ -29,18 +29,16 @@ For more information about AceMath, check our [website](https://research.nvidia.
|
|
29 |
|
30 |
## Benchmark Results
|
31 |
|
32 |
-
| |
|
33 |
-
|
|
34 |
-
|
|
35 |
-
|
|
36 |
-
|
|
37 |
-
|
|
38 |
-
|
|
39 |
-
|
|
40 |
-
|
|
41 |
-
|
|
42 |
-
| AceMath-7B-Instruct (Ours) | 93.71 | 83.14 | 51.11 | 68.05 | 42.22 | 56.64 | 75.32 | 67.17 |
|
43 |
-
| AceMath-72B-Instruct (Ours) | 96.44 | 86.10 | 56.99 | 72.21 | 48.44 | 57.24 | 85.44 | 71.84 |
|
44 |
|
45 |
Greedy decoding (pass@1) results on a variety of math reasoning benchmarks. AceMath-7B-Instruct significantly outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (67.2 vs. 62.9) and comes close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin.
|
46 |
|
|
|
29 |
|
30 |
## Benchmark Results
|
31 |
|
32 |
+
| | GPT-4o (2024-0806) | Claude-3.5 Sonnet (2024-1022) | Llama3.1-405B-Instruct | Qwen2.5-Math-1.5B-Instruct | Qwen2.5-Math-7B-Instruct | Qwen2.5-Math-72B-Instruct | AceMath-1.5B-Instruct | AceMath-7B-Instruct | AceMath-72B-Instruct |
|
33 |
+
| -------------- |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
|
34 |
+
| GSM8K | 92.90 | 96.40 | 96.80 | 84.80 | 95.20 | 95.90 | 86.95 | 93.71 | 96.44 |
|
35 |
+
| MATH | 81.10 | 75.90 | 73.80 | 75.80 | 83.60 | 85.90 | 76.84 | 83.14 | 86.10 |
|
36 |
+
| Minerva Math | 50.74 | 48.16 | 54.04 | 29.40 | 37.10 | 44.10 | 41.54 | 51.11 | 56.99 |
|
37 |
+
| GaoKao 2023En | 67.50 | 64.94 | 62.08 | 65.50 | 66.80 | 71.90 | 64.42 | 68.05 | 72.21 |
|
38 |
+
| Olympiad Bench | 43.30 | 37.93 | 34.81 | 38.10 | 41.60 | 49.00 | 33.78 | 42.22 | 48.44 |
|
39 |
+
| College Math | 48.50 | 48.47 | 49.25 | 47.70 | 46.80 | 49.50 | 54.36 | 56.64 | 57.24 |
|
40 |
+
| MMLU STEM | 87.99 | 85.06 | 83.10 | 57.50 | 71.90 | 80.80 | 62.04 | 75.32 | 85.44 |
|
41 |
+
| Average | 67.43 | 65.27 | 64.84 | 56.97 | 63.29 | 68.16 | 59.99 | 67.17 | 71.84 |
|
|
|
|
|
42 |
|
43 |
Greedy decoding (pass@1) results on a variety of math reasoning benchmarks. AceMath-7B-Instruct significantly outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (67.2 vs. 62.9) and comes close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin.
|
44 |
|