zihanliu commited on
Commit
5cb7d87
·
verified ·
1 Parent(s): b36d952

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -15
README.md CHANGED
@@ -17,31 +17,30 @@ The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical pr
17
 
18
  The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field.
19
 
 
 
20
  For more information about AceMath, check our [website](https://research.nvidia.com/labs/adlr/acemath/) and [paper](https://arxiv.org/abs/2412.15084).
21
 
 
22
  ## All Resources
23
  [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct)   [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct)   [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct)
24
 
25
  [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM)   [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM)
26
 
27
- [AceMath-Instruct Training Data](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data)   [AceMath-RM Training Data](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data)
 
 
 
 
28
 
29
- [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench)   [AceMath Evaluation Script](https://huggingface.co/datasets/nvidia/AceMath-RewardBench/tree/main/scripts)
30
 
31
  ## Benchmark Results
32
 
33
- | | GPT-4o (2024-0806) | Claude-3.5 Sonnet (2024-1022) | Llama3.1-405B-Instruct | Qwen2.5-Math-1.5B-Instruct | Qwen2.5-Math-7B-Instruct | Qwen2.5-Math-72B-Instruct | AceMath-1.5B-Instruct | AceMath-7B-Instruct | AceMath-72B-Instruct |
34
- | -------------- |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
35
- | GSM8K | 92.90 | 96.40 | 96.80 | 84.80 | 95.20 | 95.90 | 86.95 | 93.71 | 96.44 |
36
- | MATH | 81.10 | 75.90 | 73.80 | 75.80 | 83.60 | 85.90 | 76.84 | 83.14 | 86.10 |
37
- | Minerva Math | 50.74 | 48.16 | 54.04 | 29.40 | 37.10 | 44.10 | 41.54 | 51.11 | 56.99 |
38
- | GaoKao 2023En | 67.50 | 64.94 | 62.08 | 65.50 | 66.80 | 71.90 | 64.42 | 68.05 | 72.21 |
39
- | Olympiad Bench | 43.30 | 37.93 | 34.81 | 38.10 | 41.60 | 49.00 | 33.78 | 42.22 | 48.44 |
40
- | College Math | 48.50 | 48.47 | 49.25 | 47.70 | 46.80 | 49.50 | 54.36 | 56.64 | 57.24 |
41
- | MMLU STEM | 87.99 | 85.06 | 83.10 | 57.50 | 71.90 | 80.80 | 62.04 | 75.32 | 85.44 |
42
- | Average | 67.43 | 65.27 | 64.84 | 56.97 | 63.29 | 68.16 | 59.99 | 67.17 | 71.84 |
43
 
44
- Greedy decoding (pass@1) results on a variety of math reasoning benchmarks. AceMath-7B-Instruct significantly outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (67.2 vs. 62.9) and comes close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin.
45
 
46
 
47
  ## How to use
@@ -91,5 +90,4 @@ If you find our work helpful, we’d appreciate it if you could cite us.
91
 
92
 
93
  ## License
94
- All models in the AceMath family are for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/row-terms-of-use/) of the data generated by OpenAI. We put the AceMath models under the license of [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0).
95
-
 
17
 
18
  The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field.
19
 
20
+ We only recommend using the AceMath models for solving math problems. To support other tasks, we also release AceInstruct-1.5B/7B/72B, a series of general-purpose SFT models designed to handle code, math, and general knowledge tasks. These models are built upon the Qwen2.5-1.5B/7B/72B-Base.
21
+
22
  For more information about AceMath, check our [website](https://research.nvidia.com/labs/adlr/acemath/) and [paper](https://arxiv.org/abs/2412.15084).
23
 
24
+
25
  ## All Resources
26
  [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct)   [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct)   [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct)
27
 
28
  [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM)   [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM)
29
 
30
+ [AceMath-Instruct Training Data](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data)   [AceMath-RM Training Data](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data)
31
+
32
+ [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench)   [AceMath-Instruct Evaluation Script](https://huggingface.co/datasets/nvidia/AceMath-Evaluation-Script)
33
+
34
+ [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B)   [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B)   [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B)
35
 
 
36
 
37
  ## Benchmark Results
38
 
39
+ <p align="center">
40
+ <img src="https://research.nvidia.com/labs/adlr/images/acemath/acemath.png" alt="AceMath Benchmark Results" width="800">
41
+ </p>
 
 
 
 
 
 
 
42
 
43
+ We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation.
44
 
45
 
46
  ## How to use
 
90
 
91
 
92
  ## License
93
+ All models in the AceMath family are for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/row-terms-of-use/) of the data generated by OpenAI. We put the AceMath models under the license of [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0).