Spaces:
Running
on
CPU Upgrade
Opensource the evaluation code
I wonder if the code for evaluating LLMs can be released? Is it completely based on EleutherAI/lm-evaluation-harness? It's not clear how the in-context examples are selected.
Thanks!
I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.
I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.
Yep, I could also run the evaluations using the ElutherAI's repository. But I do not find which metrics are used. Is it documented somewhere I am not aware of?
I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.
Yep, I could also run the evaluations using the ElutherAI's repository. But I do not find which metrics are used. Is it documented somewhere I am not aware of?
Yes, go to files (next to app) of open_llm_leaderboard
then utils.py
file. They list the benchmarks and the metrics.
And also EleutherAI/lm-evaluation-harness doesn't provide good support for evaluating huge models (>20B). I will be great if open_llm_leaderboard can share their pipeline.
I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.
Hi, could you please share the exact command you ran? I found "hendrycks" for MMLU, but there are a ton of different subversion of hendrycks (like hendrycksTest-abstract_algebra). Is there a way to run them all?
Thanks!
I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.
Hi, could you please share the exact command you ran? I found "hendrycks" for MMLU, but there are a ton of different subversion of hendrycks (like hendrycksTest-abstract_algebra). Is there a way to run them all?
Thanks!
Hi @64bits
MMLU has 57 different tasks. They are formatted as hendrycksTest-{sub}
in the lm evaluation harness where sub
is a topic like abstract_algebra
. You need to evaluate on all tasks and compute the average of acc_norm
across tasks. You can write a bash script to run them sequentially which will be very slow. You create an array of topics and loop over them. I ran the evaluation in parallel across tasks on a Slurm based computer cluster.
Could you tell me the name of llama-7b hugging face model, because I'm struggling with the result of yahma/llama-7b-hf is not match with leader board result. @itanh0b . It would be very helpful if you can kindly share the model's name or the command you used.
Hello @v-xchen-v ,
I converted LLaMA model to huggingface format myself, so I do not know how yahma/llama-7b-hf
would do. Are you getting worse or better results? The commit which reproduces the Open LLM Leaderboard is 441e6ac
.
Here is the command that you can use to evaluate your models. MODEL_PATH
is the folder where the weights and config.json
file is, or it can be a huggingface model ID that will be downloaded automatically. MODEL
is just a name for the experiment you're using. SHOTS
is the number of few shot used per benchmark. subject
is which task you want to evaluate on. This command also allows to run on multi-GPU in case the model you're evaluating is >30B. Please make sure you're using commit 441e6ac
to reproduce the numbers on the leaderboard.
python main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained=$MODEL_PATH,trust_remote_code=True,use_accelerate=True --tasks $subject --num_fewshot $SHOTS --output_path ./$MODEL-results/$MODEL-$subject-$SHOTS-shots.json
@itanh0b Do you know how the MMLU dataset is evaluated? I am using the following command to evaluate the MMLU/hendrycksTest-(sub) 57 tasks and take an average. I got a score of 25.69% for GPT2. However, in the leaderboard, it is 27.5%. For arc_challenge (s-25), hellaswag (s-10), truthfulqa_mc (s-0), I am able to reproduce the results on the leaderboard for GPT2.
python main.py --model hf-causal --model_args pretrained=gpt2 --tasks hendrycksTest-* --device cuda:0 --num_fewshot 5
Hello @viataur ,
I'm using the commit 441e6ac
of lm-evaluation-harness to reproduce the numbers in the leaderboard. Later commits lead to different results on MMLU.
Hello @viataur ,
I'm using the commit
441e6ac
of lm-evaluation-harness to reproduce the numbers in the leaderboard. Later commits lead to different results on MMLU.
@itanh0b Thank you so much for the information, I will try it out. Do you happen to know the reason for the different results on MMLU, is it because the dataset is changed, or the code is changed for a different calculation?
@itanh0b
Thank you so much! I can confirm that the commit 441e6ac
can reproduce the results of GPT on the Leaderboard.
It is actually this PR https://github.com/EleutherAI/lm-evaluation-harness/pull/497 affects the results for GPT2. I see the difference is because the prompt format is changed. Original prompt for answer uses choices text (e.g. "mesoderm formation and occurs after neurulation.") for evaluation. While after PR 497, the latest prompt for answer uses choices letter (e.g. "A") for evaluation. If you run git checkout 441e6ac lm_eval/tasks/hendrycks_test.py
on the latest code, it can also produce the MMLU results of GPT2 on the Leaderboard.
I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.
I am beginner tot this, could you please tell us what are steps you followed for LLaMA evaluation using EleutherAI/lm-evaluation-harness
Hi
@arpithaabhishekgudekote
,
All the steps to reproduce this are in the About tab of the leaderboard :)