metadata
base_model:
- Qwen/Qwen2.5-3B-Instruct
- PowerInfer/SmallThinker-3B-Preview
datasets:
- PRIME-RL/Eurus-2-RL-Data
language:
- en
pipeline_tag: text-generation
q1-3B-PRIME
q1-3B-PRIME, a small reasoning model trained with reinforcement learning.
Trained using SmallThinker-3B-Preview as a base model (Qwen2.5-3B-Instruct full finetuned on QwQ reasoning traces) for a roughly ~22.5% improvement on the test set in 120 training steps. (Note: lots of performance left on the table since PRIME saturates after 300+ steps.)
Benchmark Performance
Math
Model | AIME24 | AMC23 | MATH-500 |
---|---|---|---|
Qwen2.5-3B-Instruct | 6.67 | 45 | - |
SmallThinker-3B-Preview | 16.667 | 57.5 | - |
q1-3B-PRIME | 26.667 | 67.5 | 64.8 |
Eurus-7B-PRIME | 26.667 | 57.8 | 79.2 |
GPT-4o | 9.3 | 45.8 | 76.4 |
Coding
Model | HumanEval | Leetcode |
---|---|---|
Qwen2.5-3B-Instruct | 74.4 | - |
q1-3B-PRIME | 71.95 | 20.55 |
GPT-4o | 90.2 | - |