ychenNLP commited on
Commit
2e18fe8
·
verified ·
1 Parent(s): 37fbeb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -0
README.md CHANGED
@@ -33,6 +33,13 @@ For more information about AceMath, check our [website](https://research.nvidia.
33
  ### Base Models
34
  - [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B)
35
 
 
 
 
 
 
 
 
36
 
37
 
38
  ## Reward Model Benchmark Results
 
33
  ### Base Models
34
  - [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B)
35
 
36
+ ## Benchmark Results (AceMath-Instruct + AceMath-72B-RM)
37
+
38
+ <p align="center">
39
+ <img src="https://research.nvidia.com/labs/adlr/images/acemath/acemath.png" alt="AceMath Benchmark Results" width="800">
40
+ </p>
41
+
42
+ We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation.
43
 
44
 
45
  ## Reward Model Benchmark Results