ychenNLP commited on
Commit
37fbeb4
·
verified ·
1 Parent(s): 29afed7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -11
README.md CHANGED
@@ -15,28 +15,37 @@ tags:
15
  We introduce AceMath, a family of frontier models designed for mathematical reasoning. The models in AceMath family, including AceMath-1.5B/7B/72B-Instruct and AceMath-7B/72B-RM, are <b>Improved using Qwen</b>.
16
  The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical problems using Chain-of-Thought (CoT) reasoning, while the AceMath-7B/72B-RM models, as outcome reward models, specialize in evaluating and scoring mathematical solutions.
17
 
18
- The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field.
19
 
20
  For more information about AceMath, check our [website](https://research.nvidia.com/labs/adlr/acemath/) and [paper](https://arxiv.org/abs/2412.15084).
21
 
22
  ## All Resources
23
- [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct) &ensp; [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct) &ensp; [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct)
24
 
25
- [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM) &ensp; [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM)
 
26
 
27
- [AceMath-Instruct Training Data](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data) &ensp; [AceMath-RM Training Data](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data)
 
28
 
29
- [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench) &ensp; [AceMath Evaluation Script](https://huggingface.co/datasets/nvidia/AceMath-RewardBench/tree/main/scripts)
 
30
 
31
- ## Benchmark Results
 
32
 
33
- <p align="center">
34
- <img src="https://research.nvidia.com/labs/adlr/images/acemath/acemath.png" alt="AceMath Benchmark Results" width="800">
35
- </p>
36
 
37
 
38
- Greedy decoding (pass@1) results on a variety of math reasoning benchmarks. AceMath-7B-Instruct significantly outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (67.2 vs. 62.9) and comes close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin.
39
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## How to use
42
  ```python
@@ -82,7 +91,7 @@ input_ids = tokenizer.encode(
82
  ).to(model.device)
83
 
84
  outputs = model(input_ids=input_ids)
85
- print(outputs[0][0])
86
  ```
87
 
88
 
 
15
  We introduce AceMath, a family of frontier models designed for mathematical reasoning. The models in AceMath family, including AceMath-1.5B/7B/72B-Instruct and AceMath-7B/72B-RM, are <b>Improved using Qwen</b>.
16
  The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical problems using Chain-of-Thought (CoT) reasoning, while the AceMath-7B/72B-RM models, as outcome reward models, specialize in evaluating and scoring mathematical solutions.
17
 
18
+ The AceMath-7B/72B-RM models are developed from their AceMath-7B/72B-Instruct models and trained on AceMath-RM-Training-Data using Bradley-Terry loss. The architecture employs standard sequence classification with a linear layer on top of the language model, using the final token to output a scalar score.pull
19
 
20
  For more information about AceMath, check our [website](https://research.nvidia.com/labs/adlr/acemath/) and [paper](https://arxiv.org/abs/2412.15084).
21
 
22
  ## All Resources
 
23
 
24
+ ### AceMath Instruction Models
25
+ - [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct), [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct), [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct)
26
 
27
+ ### AceMath Reward Models
28
+ - [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM), [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM)
29
 
30
+ ### Evaluation & Training Data
31
+ - [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench), [AceMath-Instruct Training Data](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data), [AceMath-RM Training Data](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data)
32
 
33
+ ### Base Models
34
+ - [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B)
35
 
 
 
 
36
 
37
 
38
+ ## Reward Model Benchmark Results
39
 
40
+ | Model | GSM8K | MATH500 | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. |
41
+ |---------------------------|-------|---------|--------------|----------------|-----------------|--------------|-----------|--------|
42
+ | majority@8 | 96.22 | 83.11 | 41.20 | 68.21 | 42.69 | 45.01 | 78.21 | 64.95 |
43
+ | Skywork-o1-Open-PRM-Qwen-2.5-7B | 96.92 | 86.64 | 41.00 | 72.34 | 46.50 | 46.30 | 74.01 | 66.24 |
44
+ | Qwen2.5-Math-RM-72B | 96.61 | 86.63 | 43.60 | 73.62 | 47.21 | 47.29 | 84.24 | 68.46 |
45
+ | AceMath-7B-RM (Ours) | 96.66 | 85.47 | 41.96 | 73.82 | 46.81 | 46.37 | 80.78 | 67.41 |
46
+ | AceMath-72B-RM (Ours) | 97.23 | 86.72 | 45.06 | 74.69 | 49.23 | 46.79 | 87.01 | 69.53 |
47
+
48
+ *Reward model evaluation on AceMath-RewardBench. The average results (rm@8) of reward models on math benchmarks, randomly sample 8 responses from 64 candidates with 100 random seeds. Response candidates are generated from a pool of 8 LLMs.
49
 
50
  ## How to use
51
  ```python
 
91
  ).to(model.device)
92
 
93
  outputs = model(input_ids=input_ids)
94
+ print(outputs[0][0])
95
  ```
96
 
97