open-llm-leaderboard/open_llm_leaderboard · Opensource the evaluation code

Jun 8, 2023

I wonder if the code for evaluating LLMs can be released? Is it completely based on EleutherAI/lm-evaluation-harness? It's not clear how the in-context examples are selected.

Thanks!

itanh0b

Jun 9, 2023

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

guanqun-yang

Jun 9, 2023

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Yep, I could also run the evaluations using the ElutherAI's repository. But I do not find which metrics are used. Is it documented somewhere I am not aware of?

itanh0b

Jun 9, 2023

This comment has been hidden

itanh0b

Jun 9, 2023

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Yep, I could also run the evaluations using the ElutherAI's repository. But I do not find which metrics are used. Is it documented somewhere I am not aware of?

Yes, go to files (next to app) of open_llm_leaderboard then utils.py file. They list the benchmarks and the metrics.

memray

Jun 9, 2023

And also EleutherAI/lm-evaluation-harness doesn't provide good support for evaluating huge models (>20B). I will be great if open_llm_leaderboard can share their pipeline.

64bits

Jun 11, 2023

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Hi, could you please share the exact command you ran? I found "hendrycks" for MMLU, but there are a ton of different subversion of hendrycks (like hendrycksTest-abstract_algebra). Is there a way to run them all?

Thanks!

itanh0b

Jun 11, 2023

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Hi, could you please share the exact command you ran? I found "hendrycks" for MMLU, but there are a ton of different subversion of hendrycks (like hendrycksTest-abstract_algebra). Is there a way to run them all?

Thanks!

Hi @64bits

MMLU has 57 different tasks. They are formatted as hendrycksTest-{sub} in the lm evaluation harness where sub is a topic like abstract_algebra. You need to evaluate on all tasks and compute the average of acc_normacross tasks. You can write a bash script to run them sequentially which will be very slow. You create an array of topics and loop over them. I ran the evaluation in parallel across tasks on a Slurm based computer cluster.

v-xchen-v

Jun 16, 2023

•

edited Jun 16, 2023

Could you tell me the name of llama-7b hugging face model, because I'm struggling with the result of yahma/llama-7b-hf is not match with leader board result. @itanh0b . It would be very helpful if you can kindly share the model's name or the command you used.

itanh0b

Jun 16, 2023

Hello @v-xchen-v ,

I converted LLaMA model to huggingface format myself, so I do not know how yahma/llama-7b-hf would do. Are you getting worse or better results? The commit which reproduces the Open LLM Leaderboard is 441e6ac.

clefourrier

Open LLM Leaderboard org Jun 16, 2023

@memray The code we run is at the moment based on the Eleuther AI Harness (+ some custom logic to run things faster on our cluster) - using it should give you the exact same results and numbers!
The in-context examples are selected by default in the Eleuther AI Harness here.

itanh0b

Jun 16, 2023

•

edited Jun 16, 2023

Here is the command that you can use to evaluate your models. MODEL_PATH is the folder where the weights and config.json file is, or it can be a huggingface model ID that will be downloaded automatically. MODEL is just a name for the experiment you're using. SHOTS is the number of few shot used per benchmark. subject is which task you want to evaluate on. This command also allows to run on multi-GPU in case the model you're evaluating is >30B. Please make sure you're using commit 441e6ac to reproduce the numbers on the leaderboard.

python main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained=$MODEL_PATH,trust_remote_code=True,use_accelerate=True --tasks $subject --num_fewshot $SHOTS --output_path ./$MODEL-results/$MODEL-$subject-$SHOTS-shots.json

memray changed discussion status to closed Jun 16, 2023

taoari

Jun 22, 2023

@itanh0b Do you know how the MMLU dataset is evaluated? I am using the following command to evaluate the MMLU/hendrycksTest-(sub) 57 tasks and take an average. I got a score of 25.69% for GPT2. However, in the leaderboard, it is 27.5%. For arc_challenge (s-25), hellaswag (s-10), truthfulqa_mc (s-0), I am able to reproduce the results on the leaderboard for GPT2.

python main.py --model hf-causal --model_args pretrained=gpt2 --tasks hendrycksTest-* --device cuda:0 --num_fewshot 5

itanh0b

Jun 22, 2023

•

edited Jun 22, 2023

Hello @viataur ,

I'm using the commit 441e6ac of lm-evaluation-harness to reproduce the numbers in the leaderboard. Later commits lead to different results on MMLU.

taoari

Jun 22, 2023

Hello @viataur ,

I'm using the commit 441e6ac of lm-evaluation-harness to reproduce the numbers in the leaderboard. Later commits lead to different results on MMLU.

@itanh0b Thank you so much for the information, I will try it out. Do you happen to know the reason for the different results on MMLU, is it because the dataset is changed, or the code is changed for a different calculation?

itanh0b

Jun 22, 2023

@viataur the way they extract the continuation changed to suit the LLaMA model tokenizer. Here is the commit that affected MMLU results for LLaMA models. It might be the same one that affects MMLU results for gpt2.

taoari

Jun 23, 2023

@itanh0b Thank you so much! I can confirm that the commit 441e6ac can reproduce the results of GPT on the Leaderboard.

It is actually this PR https://github.com/EleutherAI/lm-evaluation-harness/pull/497 affects the results for GPT2. I see the difference is because the prompt format is changed. Original prompt for answer uses choices text (e.g. "mesoderm formation and occurs after neurulation.") for evaluation. While after PR 497, the latest prompt for answer uses choices letter (e.g. "A") for evaluation. If you run git checkout 441e6ac lm_eval/tasks/hendrycks_test.py on the latest code, it can also produce the MMLU results of GPT2 on the Leaderboard.

arpithaabhishekgudekote

Nov 10, 2023

•

edited Nov 10, 2023

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

I am beginner tot this, could you please tell us what are steps you followed for LLaMA evaluation using EleutherAI/lm-evaluation-harness

clefourrier

Open LLM Leaderboard org Nov 10, 2023

Hi @arpithaabhishekgudekote ,
All the steps to reproduce this are in the About tab of the leaderboard :)