Kudos - really strong small model! MMLU-Pro benchmarks

#4
by Philp - opened

I'm running the 4 bit mlx quantized model on a standard m4 macbook with LMStudio.
Some of the areas the performance is on par or better than 70B models.
Particularly strong in math, biology, business, chemistry, economics, engineering - may not be all.

MMLU-PRO Leaderboard:
https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

2025-02-24 09:44:21.855992
{
"comment": "",
"server": {
"url": "http://127.0.0.1:1234/v1",
"model": "openthinker-7b-mlx@4bit",
"timeout": 600.0
},
"inference": {
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 2048,
"system_prompt": "The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with "the answer is (X)" where X is the correct letter choice.",
"style": "multi_chat"
},
"test": {
"subset": 0.05,
"parallel": 1
},
"log": {
"verbosity": 0,
"log_prompt": true
}
}
Finished testing biology in .
Total, 24/35, 68.57%
Random Guess Attempts, 1/35, 2.86%
Correct Random Guesses, 0/1, 0.00%
Adjusted Score Without Random Guesses, 24/34, 70.59%
Finished testing business in 29 minutes 32 seconds.
Total, 26/39, 66.67%
Random Guess Attempts, 0/39, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 26/39, 66.67%
Finished testing chemistry in 1 hours 12 minutes 26 seconds.
Total, 33/56, 58.93%
Random Guess Attempts, 1/56, 1.79%
Correct Random Guesses, 0/1, 0.00%
Adjusted Score Without Random Guesses, 33/55, 60.00%
Finished testing computer science in .
Total, 10/20, 50.00%
Random Guess Attempts, 0/20, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 10/20, 50.00%
Finished testing economics in 19 minutes 23 seconds.
Total, 27/42, 64.29%
Random Guess Attempts, 1/42, 2.38%
Correct Random Guesses, 1/1, 100.00%
Adjusted Score Without Random Guesses, 26/41, 63.41%
Finished testing engineering in 43 minutes 56 seconds.
Total, 21/48, 43.75%
Random Guess Attempts, 0/48, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 21/48, 43.75%
Finished testing health in 17 minutes 18 seconds.
Total, 20/40, 50.00%
Random Guess Attempts, 0/40, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 20/40, 50.00%
Finished testing history in 11 minutes 13 seconds.
Total, 7/19, 36.84%
Random Guess Attempts, 0/19, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 7/19, 36.84%
Finished testing law in 36 minutes 52 seconds.
Total, 15/55, 27.27%
Random Guess Attempts, 0/55, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 15/55, 27.27%
Finished testing math in 46 minutes 10 seconds.
Total, 51/67, 76.12%
Random Guess Attempts, 0/67, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 51/67, 76.12%
Finished testing philosophy in 12 minutes 5 seconds.
Total, 10/24, 41.67%
Random Guess Attempts, 0/24, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 10/24, 41.67%
Finished testing physics in 46 minutes 30 seconds.
Total, 37/64, 57.81%
Random Guess Attempts, 0/64, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 37/64, 57.81%
Finished testing psychology in 14 minutes 49 seconds.
Total, 18/39, 46.15%
Random Guess Attempts, 0/39, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 18/39, 46.15%
Finished testing other in 23 minutes 39 seconds.
Total, 22/46, 47.83%
Random Guess Attempts, 0/46, 0.00%
Correct Random Guesses, division by zero error
Adjusted Score Without Random Guesses, 22/46, 47.83%
Finished the benchmark in 6 hours 14 minutes 1 seconds.
Total, 321/594, 54.04%
Random Guess Attempts, 3/594, 0.51%
Correct Random Guesses, 1/3, 33.33%
Adjusted Score Without Random Guesses, 320/591, 54.15%
Token Usage:
Prompt tokens: min 917, average 1382, max 2479, total 744828, tk/s 33.19
Completion tokens: min 30, average 900, max 2047, total 485308, tk/s 21.63
Markdown Table:

overall biology business chemistry computer science economics engineering health history law math philosophy physics psychology other
54.04 68.57 66.67 58.93 50.00 64.29 43.75 50.00 36.84 27.27 76.12 41.67 57.81 46.15 47.83
Open Thoughts org

This is amazing. Thank you so much for doing these evals for us! At 21 TPS, how long did the total eval take?

Sign up or log in to comment