|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-3B-Instruct |
|
datasets: |
|
- PRIME-RL/Eurus-2-RL-Data |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# q1-3B-PRIME |
|
|
|
**q1-3B-PRIME**, a small reasoning model trained with reinforcement learning. |
|
|
|
Trained using SmallThinker-3B-Preview as a base model (Qwen2.5-3B-Instruct full finetuned on QwQ reasoning traces) for a roughly ~22.5% improvement on the test set in 120 training steps. (Note: lots of performance left on the table since PRIME saturates after 300+ steps.) |
|
|
|
# Benchmark Performance |
|
|
|
## Math |
|
| Model | AIME24 | AMC23 | MATH-500 | |
|
|---------|--------|-------|-------| |
|
| Qwen2.5-3B-Instruct | 6.67 | 45 | - | |
|
| **q1-3B-PRIME** | **26.667** | **67.5** | 64.8 | |
|
| SmallThinker-3B-Preview| 16.667 | 57.5 | - | |
|
| GPT-4o | 9.3 | 45.8 | **76.4** | |
|
|
|
## Coding |
|
| Model | HumanEval | Leetcode | |
|
|---------|--------|-------| |
|
| Qwen2.5-3B-Instruct | 74.4 | - | |
|
| **q1-3B-PRIME** | 71.95 | **20.55** | |
|
| GPT-4o | 90.2 | - | |