open-llm-leaderboard/open_llm_leaderboard · How to understand the different between local report and the scores reported on open-llm-leaderboard/

Oct 4, 2024

I ran mistralai/Mistral-7B-Instruct-v0.2 locally according to the instructions:

git clone [email protected]:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout adding_all_changess
pip install -e .[math,ifeval,sentencepiece]
lm-eval --model_args="pretrained=mistralai/Mistral-7B-Instruct-v0.2,revision=,dtype=float" --tasks=leaderboard --batch_size=auto

And I got the results as following:
hf (pretrained=./Mistral-7B-Instruct-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard	N/A	none	0	acc	↑	0.2713	±	0.0041
		none	0	acc_norm	↑	0.4322	±	0.0054
		none	0	exact_match	↑	0.0204	±	0.0039
		none	0	inst_level_loose_acc	↑	0.5588	±	N/A
		none	0	inst_level_strict_acc	↑	0.5072	±	N/A
		none	0	prompt_level_loose_acc	↑	0.4288	±	0.0213
		none	0	prompt_level_strict_acc	↑	0.3826	±	0.0209
- leaderboard_bbh	N/A	none	3	acc_norm	↑	0.4581	±	0.0062
- leaderboard_bbh_boolean_expressions	0	none	3	acc_norm	↑	0.7840	±	0.0261
- leaderboard_bbh_causal_judgement	0	none	3	acc_norm	↑	0.6150	±	0.0357
- leaderboard_bbh_date_understanding	0	none	3	acc_norm	↑	0.3680	±	0.0306
- leaderboard_bbh_disambiguation_qa	0	none	3	acc_norm	↑	0.6120	±	0.0309
- leaderboard_bbh_formal_fallacies	0	none	3	acc_norm	↑	0.4760	±	0.0316
- leaderboard_bbh_geometric_shapes	0	none	3	acc_norm	↑	0.3560	±	0.0303
- leaderboard_bbh_hyperbaton	0	none	3	acc_norm	↑	0.6520	±	0.0302
- leaderboard_bbh_logical_deduction_five_objects	0	none	3	acc_norm	↑	0.3440	±	0.0301
- leaderboard_bbh_logical_deduction_seven_objects	0	none	3	acc_norm	↑	0.3080	±	0.0293
- leaderboard_bbh_logical_deduction_three_objects	0	none	3	acc_norm	↑	0.4840	±	0.0317
- leaderboard_bbh_movie_recommendation	0	none	3	acc_norm	↑	0.5240	±	0.0316
- leaderboard_bbh_navigate	0	none	3	acc_norm	↑	0.5520	±	0.0315
- leaderboard_bbh_object_counting	0	none	3	acc_norm	↑	0.3600	±	0.0304
- leaderboard_bbh_penguins_in_a_table	0	none	3	acc_norm	↑	0.4315	±	0.0411
- leaderboard_bbh_reasoning_about_colored_objects	0	none	3	acc_norm	↑	0.4120	±	0.0312
- leaderboard_bbh_ruin_names	0	none	3	acc_norm	↑	0.4640	±	0.0316
- leaderboard_bbh_salient_translation_error_detection	0	none	3	acc_norm	↑	0.4000	±	0.0310
- leaderboard_bbh_snarks	0	none	3	acc_norm	↑	0.5618	±	0.0373
- leaderboard_bbh_sports_understanding	0	none	3	acc_norm	↑	0.7960	±	0.0255
- leaderboard_bbh_temporal_sequences	0	none	3	acc_norm	↑	0.2920	±	0.0288
- leaderboard_bbh_tracking_shuffled_objects_five_objects	0	none	3	acc_norm	↑	0.2560	±	0.0277
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	0	none	3	acc_norm	↑	0.1480	±	0.0225
- leaderboard_bbh_tracking_shuffled_objects_three_objects	0	none	3	acc_norm	↑	0.3400	±	0.0300
- leaderboard_bbh_web_of_lies	0	none	3	acc_norm	↑	0.5160	±	0.0317
- leaderboard_gpqa	N/A	none	0	acc_norm	↑	0.2819	±	0.0130
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2374	±	0.0303
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2949	±	0.0195
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2857	±	0.0214
- leaderboard_ifeval	2	none	0	inst_level_loose_acc	↑	0.5588	±	N/A
		none	0	inst_level_strict_acc	↑	0.5072	±	N/A
		none	0	prompt_level_loose_acc	↑	0.4288	±	0.0213
		none	0	prompt_level_strict_acc	↑	0.3826	±	0.0209
- leaderboard_math_algebra_hard	1	none	4	exact_match	↑	0.0261	±	0.0091
- leaderboard_math_counting_and_prob_hard	1	none	4	exact_match	↑	0.0244	±	0.0140
- leaderboard_math_geometry_hard	1	none	4	exact_match	↑	0.0076	±	0.0076
- leaderboard_math_hard	N/A	none	4	exact_match	↑	0.0204	±	0.0039
- leaderboard_math_intermediate_algebra_hard	1	none	4	exact_match	↑	0.0071	±	0.0050
- leaderboard_math_num_theory_hard	1	none	4	exact_match	↑	0.0065	±	0.0065
- leaderboard_math_prealgebra_hard	1	none	4	exact_match	↑	0.0570	±	0.0167
- leaderboard_math_precalculus_hard	1	none	4	exact_match	↑	0.0074	±	0.0074
- leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.2713	±	0.0041
- leaderboard_musr	N/A	none	0	acc_norm	↑	0.4722	±	0.0179
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5440	±	0.0316
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.3477	±	0.0298
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.5280	±	0.0316

Groups	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard	N/A	none	0	acc	↑	0.2713	±	0.0041
		none	0	acc_norm	↑	0.4322	±	0.0054
		none	0	exact_match	↑	0.0204	±	0.0039
		none	0	inst_level_loose_acc	↑	0.5588	±	N/A
		none	0	inst_level_strict_acc	↑	0.5072	±	N/A
		none	0	prompt_level_loose_acc	↑	0.4288	±	0.0213
		none	0	prompt_level_strict_acc	↑	0.3826	±	0.0209
- leaderboard_bbh	N/A	none	3	acc_norm	↑	0.4581	±	0.0062
- leaderboard_gpqa	N/A	none	0	acc_norm	↑	0.2819	±	0.0130
- leaderboard_math_hard	N/A	none	4	exact_match	↑	0.0204	±	0.0039
- leaderboard_musr	N/A	none	0	acc_norm	↑	0.4722	±	0.0179

But the scores reported are

mistralai/Mistral-7B-Instruct-v0.2 📑

18.44
54.96
22.91
2.64
3.47
7.61
19.08

How to understand the different between two? and how to I convert local result into format of open-llm-leaderboard?

alozowski

Open LLM Leaderboard org Oct 7, 2024

Hi @xinchen9 ,

Let me help you – after the evaluation you will get a results file, like this one we have for mistralai/Mistral-7B-Instruct-v0.2 – link to the results json

We process all result files to get normalised values for all benchmarks, you can find out more here in our documentation. There you will also find the colab notebook, feel free to copy it and normalise your results

I hope it's clear, feel free to ask any questions!

alozowski

Open LLM Leaderboard org Oct 14, 2024

Closing this discussion due to inactivity

alozowski changed discussion status to closed Oct 14, 2024