bespokelabs/Bespoke-Stratos-32B

Model description

This model is a fine-tuned version of Qwen/Qwen2.5-32B-Instruct on the Bespoke-Stratos-17k dataset. The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at Bespoke-Stratos-17k. It outperforms Qwen-2.5-32B-Instruct on reasoning benchmarks:

Metric	Bespoke-Stratos-32B	Sky-T1-32B	o1-preview	DeepSeek-R1	DeepSeek-R1-Distill-Qwen-32B (Ours // Reported)
AIME2024	63.3	43.3	40.0	79.8	66.7 // 72.6
MATH500	93.0	82.4	81.4	97.3	89.8 // 94.3
GPQA-Diamond	58.1	56.8	75.2	71.5	61.1 // 62.1
LCB v2 Easy	96.7	86.3	92.9	-	91.2 // -
LCB v2 Medium	75.2	56.8	54.9	-	75.7 // -
LCB v2 Hard	26.2	17.9	16.3	-	38.2 // -
LCB v2 All	71.1	57.9	59.1	-	72.2 // -

Intended uses & limitations

Apache 2.0 License

Training procedure

We used 8xH100 to train the model for 27 hours.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 12
total_train_batch_size: 96
total_eval_batch_size: 64
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

Framework versions

Transformers 4.46.1
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

bespokelabs
/

Bespoke-Stratos-32B

Model description

Intended uses & limitations

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for bespokelabs/Bespoke-Stratos-32B

Dataset used to train bespokelabs/Bespoke-Stratos-32B

Evaluation results