Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

938

A bit of reconciliation

#546

by vince62s - opened Jan 17

Discussion

vince62s

Jan 17

Hi,
@clefourrier I am tagging you for two major things IMO.

It would be great if there could some kind of reconciliation between what is reported in the Leaderboard and papers for the "mainstream model" like Mistral7B, Phi-2, ... For instance Mistral7B reports is 60.1 vs HF 64.16 for MMLU
It would be great if we could make sure to include the mainstream foundation models. I can't find the original Llama2 models in the Leaderboard

Cheers and thanks again for the great work.

clefourrier

Open LLM Leaderboard org Jan 17

Hi @vince62s ,
Thanks for the issue!
For 1, our evaluation results are completely reproducible (see our About), using the specific setup of the Harness (contrary to a lot of reported results in technical reports which do not explain how specifically they evaluate their models) - differences in scores mostly reflect differences in prompting/evaluation setups. Since our goal is to provide one single and reproducible way to evaluate models, this is not something we want to change.
For 2, llama-2 is actually in the leaderboard, it's just under "deleted" (because of an access token problem since it's gated, I'm fixing it, sorry about that).
Does that answer your points?

vince62s

Jan 17

Thanks for the quick response. Since you guys can have close talks to those Folks, wouldn't that make sense to reach them out and understand their methodology leading to -4 points in mmlu ? it's a big gap.
I'll check llama2 when it's up.

vince62s

Jan 17

btw I don't see mistral-instruct-v0.2 nor mixtral (legacy ones)

vince62s

Jan 18

FYI, for 01-ai/Yi there are duplicates.

clefourrier

Open LLM Leaderboard org Jan 18

Hi @vince62s ,
Have you taken a look at the FAQ? It's in the about tab, and I feel like it could answer a number of your questions (regarding duplicates for example).

The mixtral models are flagged atm because of incorrect metadata - if you want to be sure you are displaying all the models, don't forget to select all the available checkboxes :)

clefourrier changed discussion status to closed Jan 18

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment