q1-3B-PRIME

q1-3B-PRIME, a small reasoning model trained with reinforcement learning.

Trained using SmallThinker-3B-Preview as a base model (Qwen2.5-3B-Instruct full finetuned on QwQ reasoning traces) for a roughly ~22.5% improvement on the test set in 120 training steps. (Note: lots of performance left on the table since PRIME saturates after 300+ steps.)

Benchmark Performance

Math

Model	AIME24	AMC23	MATH-500
Qwen2.5-3B-Instruct	6.67	45	-
SmallThinker-3B-Preview	16.667	57.5	-
q1-3B-PRIME	26.667	67.5	64.8
Eurus-7B-PRIME	26.667	57.8	79.2
GPT-4o	9.3	45.8	76.4

Coding

Model	HumanEval	Leetcode
Qwen2.5-3B-Instruct	74.4	-
q1-3B-PRIME	71.95	20.55
GPT-4o	90.2	-

Model tree for rawsh/q1-3B-PRIME

rawsh
/

q1-3B-PRIME

q1-3B-PRIME

Benchmark Performance

Math

Coding

Model tree for rawsh/q1-3B-PRIME

Dataset used to train rawsh/q1-3B-PRIME