This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.

Eval results using SmolLM evaluation scripts (LightEval):

Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins.

Task Version Metric aloobun/d-SmolLM2-360M Value HuggingFaceTB/SmolLM2-360M Value
all acc_norm 0.4653 0.4642
qem 0.0961 0.1004
custom:arc:_average:0 acc_norm 0.5303 0.5305
custom:arc:challenge:0 0 acc_norm 0.3771 0.3797
custom:arc:easy:0 0 acc_norm 0.6835 0.6814
custom:commonsense_qa:0 0 acc_norm 0.3784 0.3759
custom:gsm8k:5 0 qem 0.0326 0.0334
custom:hellaswag:0 0 acc_norm 0.5418 0.5456
custom:mmlu_pro:0 0 acc_norm 0.1127 0.1130
custom:openbook_qa:0 0 acc_norm 0.3760 0.3720
custom:piqa:0 0 acc_norm 0.7214 0.7220
custom:trivia_qa:0 0 qem 0.1596 0.1675
custom:winogrande:0 0 acc_norm 0.5312 0.5241

Eval results using lm-eval evaluation scripts:

It slightly improves upon the performance of the basemodel on the following tasks:

Tasks HuggingFaceTB/SmolLM2-360M Value aloobun/d-SmolLM2-360M Value
- leaderboard_bbh_causal_judgement 0.4545 0.4652
- leaderboard_bbh_geometric_shapes 0.1680 0.2040
- leaderboard_bbh_movie_recommendation 0.2120 0.2440
- leaderboard_bbh_penguins_in_a_table 0.2055 0.2123
- leaderboard_bbh_reasoning_about_colored_objects 0.1160 0.1320
- leaderboard_bbh_ruin_names 0.2360 0.2480
- leaderboard_bbh_salient_translation_error_detection 0.1480 0.2120
- leaderboard_bbh_snarks 0.5169 0.5281
- leaderboard_bbh_temporal_sequences 0.2720 0.2800
- leaderboard_musr_murder_mysteries 0.5040 0.5160

Well, it didn’t work as well as I hoped, will try again.

Eval Results aloobun/d-SmolLM2-360M (WIP)

GPQA

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm ↑ 0.2071 ± 0.0289
- leaderboard_gpqa_extended 1 none 0 acc_norm ↑ 0.2308 ± 0.0180
- leaderboard_gpqa_main 1 none 0 acc_norm ↑ 0.2679 ± 0.0209

MUSR

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm ↑ 0.5160 ± 0.0317
- leaderboard_musr_object_placements 1 none 0 acc_norm ↑ 0.2383 ± 0.0267
- leaderboard_musr_team_allocation 1 none 0 acc_norm ↑ 0.4400 ± 0.0315

BBH

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm ↑ 0.5480 ± 0.0315
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm ↑ 0.4652 ± 0.0366
- leaderboard_bbh_date_understanding 1 none 3 acc_norm ↑ 0.1560 ± 0.0230
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm ↑ 0.3120 ± 0.0294
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm ↑ 0.5240 ± 0.0316
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm ↑ 0.2040 ± 0.0255
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm ↑ 0.5000 ± 0.0317
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm ↑ 0.2240 ± 0.0264
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm ↑ 0.1440 ± 0.0222
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm ↑ 0.3320 ± 0.0298
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm ↑ 0.2440 ± 0.0272
- leaderboard_bbh_navigate 1 none 3 acc_norm ↑ 0.5800 ± 0.0313
- leaderboard_bbh_object_counting 1 none 3 acc_norm ↑ 0.2080 ± 0.0257
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm ↑ 0.2123 ± 0.0340
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm ↑ 0.1320 ± 0.0215
- leaderboard_bbh_ruin_names 1 none 3 acc_norm ↑ 0.2480 ± 0.0274
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm ↑ 0.2120 ± 0.0259
- leaderboard_bbh_snarks 1 none 3 acc_norm ↑ 0.5281 ± 0.0375
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm ↑ 0.4600 ± 0.0316
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm ↑ 0.2800 ± 0.0285
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm ↑ 0.1720 ± 0.0239
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm ↑ 0.1440 ± 0.0222
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm ↑ 0.3000 ± 0.0290
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm ↑ 0.5480 ± 0.0315

MMLU_PRO

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_mmlu_pro 0.1 none 5 acc ↑ 0.1173 ± 0.0029

IFEVAL

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc ↑ 0.2866 ± N/A
none 0 inst_level_strict_acc ↑ 0.2770 ± N/A
none 0 prompt_level_loose_acc ↑ 0.1497 ± 0.0154
none 0 prompt_level_strict_acc ↑ 0.1423 ± 0.0150

MATH HARD

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 2 none 4 exact_match ↑ 0.0033 ± 0.0033
- leaderboard_math_counting_and_prob_hard 2 none 4 exact_match ↑ 0.0081 ± 0.0081
- leaderboard_math_geometry_hard 2 none 4 exact_match ↑ 0.0000 ± 0.0000
- leaderboard_math_intermediate_algebra_hard 2 none 4 exact_match ↑ 0.0000 ± 0.0000
- leaderboard_math_num_theory_hard 2 none 4 exact_match ↑ 0.0065 ± 0.0065
- leaderboard_math_prealgebra_hard 2 none 4 exact_match ↑ 0.0104 ± 0.0073
- leaderboard_math_precalculus_hard 2 none 4 exact_match ↑ 0.0000 ± 0.0000

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 6.01
IFEval (0-Shot) 20.97
BBH (3-Shot) 4.76
MATH Lvl 5 (4-Shot) 0.23
GPQA (0-shot) 0.45
MuSR (0-shot) 7.76
MMLU-PRO (5-shot) 1.88
Downloads last month
137
Safetensors
Model size
362M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for aloobun/d-SmolLM2-360M

Quantizations
1 model

Collection including aloobun/d-SmolLM2-360M

Evaluation results