add results table
Browse files
README.md
CHANGED
@@ -101,8 +101,6 @@ NuminaMath is a series of language models that are trained with two stages of su
|
|
101 |
* **Stage 1:** fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
|
102 |
* **Stage 2:** fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs.
|
103 |
|
104 |
-
|
105 |
-
|
106 |
## Model description
|
107 |
|
108 |
- **Model type:** A 72B parameter math LLM fine-tuned on a dataset with 860k+ math problem-solution pairs.
|
@@ -110,6 +108,21 @@ NuminaMath is a series of language models that are trained with two stages of su
|
|
110 |
- **License:** Tongyi Qianwen
|
111 |
- **Finetuned from model:** [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B)
|
112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
### Model Sources
|
114 |
|
115 |
<!-- Provide the basic links for the model. -->
|
|
|
101 |
* **Stage 1:** fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
|
102 |
* **Stage 2:** fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs.
|
103 |
|
|
|
|
|
104 |
## Model description
|
105 |
|
106 |
- **Model type:** A 72B parameter math LLM fine-tuned on a dataset with 860k+ math problem-solution pairs.
|
|
|
108 |
- **License:** Tongyi Qianwen
|
109 |
- **Finetuned from model:** [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B)
|
110 |
|
111 |
+
## Model performance
|
112 |
+
|
113 |
+
| | | NuminaMath-72B-CoT | NuminaMath-72B-TIR | Qwen2-72B-Instruct | Llama3-70B-Instruct | Claude-3.5-Sonnet | GPT-4o-0513 |
|
114 |
+
| --- | --- | :---: | :---: | :---: | :---: | :---: | :---: |
|
115 |
+
| **GSM8k** | 0-shot | 91.4% | 91.5% | 91.1% | 93.0% | **96.4%** | 95.8% |
|
116 |
+
| Grade school math |
|
117 |
+
| **MATH** | 0-shot | 68.0% | 75.8% | 59.7% | 50.4% | 71.1% | **76.6%** |
|
118 |
+
| Math problem-solving |
|
119 |
+
| **AMC 2023** | 0-shot | 21/40 | **24/40** | 19/40 | 13/40 | 17/40 | 20/40 |
|
120 |
+
| Competition-level math | maj@64 | 24/40 | **34/40** | 21/40 | 13/40 | - | - |
|
121 |
+
| **AIME 2024** | 0-shot | 1/30 | **5/30** | 3/30 | 0/30 | 2/30 | 2/30 |
|
122 |
+
| Competition-level math | maj@64 | 3/30 | **12/30** | 4/30 | 2/30 | - | - |
|
123 |
+
|
124 |
+
*Table: Comparison of various open weight and proprietary language models on different math benchmarks. All scores except those for NuminaMath-72B-TIR are reported without tool-integrated reasoning.*
|
125 |
+
|
126 |
### Model Sources
|
127 |
|
128 |
<!-- Provide the basic links for the model. -->
|