meta-llama/Llama-3.1-8B-Instruct · GSM8K Evaluation Result: 84.5 vs. 76.95

Aug 4, 2024

•

edited Aug 4, 2024

In the Llama 3.1 technical report, Llama-3.1-8B was evaluated with an 84.5 score on the GSM8K benchmark. However, when I evaluate with lm-evaluation-harness

accelerate launch -m lm_eval --model hf     --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct   --tasks gsm8k  --batch_size auto

I got the following result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7695|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.7521|±  |0.0119|

There seems to be a significant discrepancy. Am I missing something in the evaluation settings?

Orenguteng

Aug 10, 2024

@tanliboy Did you ensure that it is set to num_fewshot=8

tanliboy

Aug 10, 2024

@Orenguteng thanks for pointing it out! The above result was actually 5-shot.

I corrected it with a new run and got the result below:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.7779|±  |0.0114|
|     |       |strict-match    |     8|exact_match|↑  |0.7672|±  |0.0116|

It is better than 5-shot, but there is still a wide gap.

tanliboy

Aug 26, 2024

•

edited Aug 26, 2024

@wukaixingxp Any thoughts on the difference?

wukaixingxp

Meta Llama org Aug 26, 2024

•

edited Aug 26, 2024

Please check my readme about reproducing the huggingface leaderboard evaluation. Basically, you need to checkout the right branch under their fork and use --apply_chat_template --fewshot_as_multiturn for the instruct model. I used the command accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 --apply_chat_template --fewshot_as_multiturn --log_samples --output_path eval_results --tasks gsm8k --batch_size 4, and got this result which is closer to our reported number 84.5:

hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8234|±  |0.0105|
|     |       |strict-match    |     5|exact_match|↑  |0.7968|±  |0.0111|

wukaixingxp

Meta Llama org Aug 26, 2024

•

edited Aug 26, 2024

I think the difference (0.79 vs 0.85) can come from different prompting style and n-shot (5 vs 8). I just found there is a gsm8k-cot-llama.yaml created by community user that follows our style. While this is not an official Meta implementation, but I got a closer result, my command was accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 --apply_chat_template --fewshot_as_multiturn --log_samples --output_path eval_results --tasks gsm8k_cot_llama --batch_size 4

hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4
|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match|↑  |0.8544|±  |0.0097|
|               |       |strict-match    |     8|exact_match|↑  |0.8514|±  |0.0098|

Let me know if you have any more questions!

tanliboy

Aug 26, 2024

•

edited Aug 26, 2024

Thank you, @wukaixingxp !

I think the difference (0.79 vs 0.85) can come from different prompting style and n-shot (5 vs 8).

Would you mind elaborating more on the different prompting style? Which prompting style should we use for Llama and how would the requirement different from other models?

I tested with your above commands and ran into an error

ValueError: If fewshot_as_multiturn is set, num_fewshot must be greater than 0.

But after setting it with 8 shots, I can see a similar result in your report. It seems --apply_chat_template --fewshot_as_multiturn is critical here.

Another significant gap I saw was about the ifeval evaluation result:

accelerate launch -m lm_eval --model hf     --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct   --tasks ifeval  --batch_size 32

|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      2|none  |     0|inst_level_loose_acc   |↑  |0.6223|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.5935|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.4843|±  |0.0215|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.4455|±  |0.0214|

The score is very low compared to the reported result (80.4), whereas the Gemma-2-9b-it can achieve 76 with the same setting.

wukaixingxp

Meta Llama org Aug 26, 2024

•

edited Aug 26, 2024

Please follow the open_llm_leaderboard reproducibility section to install the correct version and this will solve your ValueError: If fewshot_as_multiturn is set, num_fewshot must be greater than 0. error. --apply_chat_template --fewshot_as_multiturn is required as the instruct model needs the chat_template to work. I think main difference between gsm8k and gsm8k-cot-llama is the doc_to_text config, which defines the prompt style, as show here: gsm8k-cot-llama VS gsm8k. Please compare those two yaml files to understand diff details.

tanliboy

Aug 30, 2024

Thank you, @wukaixingxp !

sarvghotra

Sep 11, 2024

•

edited Sep 11, 2024

@wukaixingxp The result, 84.5, (Table 2) in the technical report is for Llama 3 8B, not for Llama 3 8B Instruct. I think there is still discrepancy with the numbers in the above reply here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/81#66ccde9250c670b0ff5d49d6

wukaixingxp

Meta Llama org Sep 11, 2024

I believe gsm8k 84.5 is for Llama 3.1 8B Instruct, 80.6 is for Llama 3 8B Instruct, please check our model card

tanliboy

Sep 16, 2024

I've done quite a lot of fine-tuning attempts with this model, but one issue that keeps troubling me is the significant drop in IFEVAL scores each time I fine-tune.
So far, I haven’t found a dataset or method that allows me to retain the IFEVAL score while fine-tuning.
Do you have any suggestions or insights on how to address this?

Orenguteng

Sep 16, 2024

@tanliboy It's very very hard to tune on top of a instruct model and retain its intelligence, all about tuning parameters, dataset end methods used - but its doable. You can see in my research and check https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard for comparison :

Keep lookout I will post more about results in the future as im currently researching this area.

tanliboy

Sep 16, 2024

•

edited Sep 17, 2024

@Orenguteng it is great to know that you retained the IFEVAL score while improving GPQA.
Any suggestions/insights on this dimension? Have you incorporated the FLAN collections (https://huggingface.co/datasets/Open-Orca/FLAN)?

Also, do you know how I can reproduce these score in the leaderboard dashboard?

I tried with

git clone [email protected]:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout adding_all_changess
accelerate launch -m lm_eval --model_args pretrained=<model>,dtype=bfloat16  --log_samples --output_path eval_results --tasks leaderboard  --batch_size 4 --apply_chat_template --fewshot_as_multiturn

following the instruction in the page, but I got.

Base:

|         Groups         |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------------------------|-------|------|-----:|-----------------------|---|-----:|---|------|
|leaderboard             |N/A    |none  |     0|acc                    |↑  |0.3782|±  |0.0044|
|                        |       |none  |     0|acc_norm               |↑  |0.4617|±  |0.0054|
|                        |       |none  |     0|exact_match            |↑  |0.1707|±  |0.0098|
|                        |       |none  |     0|inst_level_loose_acc   |↑  |0.8441|±  |N/A   |
|                        |       |none  |     0|inst_level_strict_acc  |↑  |0.8106|±  |N/A   |
|                        |       |none  |     0|prompt_level_loose_acc |↑  |0.7782|±  |0.0179|
|                        |       |none  |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|
| - leaderboard_bbh      |N/A    |none  |     3|acc_norm               |↑  |0.5070|±  |0.0063|
| - leaderboard_gpqa     |N/A    |none  |     0|acc_norm               |↑  |0.2894|±  |0.0131|
| - leaderboard_math_hard|N/A    |none  |     4|exact_match            |↑  |0.1707|±  |0.0098|
| - leaderboard_musr     |N/A    |none  |     0|acc_norm               |↑  |0.3876|±  |0.0171|

My Fine-tuning:

|         Groups         |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------------------------|-------|------|-----:|-----------------------|---|-----:|---|------|
|leaderboard             |N/A    |none  |     0|acc                    |↑  |0.3526|±  |0.0044|
|                        |       |none  |     0|acc_norm               |↑  |0.4573|±  |0.0054|
|                        |       |none  |     0|exact_match            |↑  |0.1110|±  |0.0084|
|                        |       |none  |     0|inst_level_loose_acc   |↑  |0.7326|±  |N/A   |
|                        |       |none  |     0|inst_level_strict_acc  |↑  |0.7050|±  |N/A   |
|                        |       |none  |     0|prompt_level_loose_acc |↑  |0.6322|±  |0.0208|
|                        |       |none  |     0|prompt_level_strict_acc|↑  |0.6026|±  |0.0211|
| - leaderboard_bbh      |N/A    |none  |     3|acc_norm               |↑  |0.4956|±  |0.0063|
| - leaderboard_gpqa     |N/A    |none  |     0|acc_norm               |↑  |0.2919|±  |0.0132|
| - leaderboard_math_hard|N/A    |none  |     4|exact_match            |↑  |0.1110|±  |0.0084|
| - leaderboard_musr     |N/A    |none  |     0|acc_norm               |↑  |0.4259|±  |0.0176

The score is quite different from the score reporeted in the leadership board page.

Orenguteng

Sep 17, 2024

@tanliboy Just showcasing that it is possible, I'm unfortunately not able to provide details around my training, but I can give you 2 insights:

1: No additional knowledge was further trained upon it in my case, it was a custom made dataset only made for alignment research purposes. Biases and alignments in LLM is proven to "dumb" down the model and this is what I'm showcasing. Which will even be better for future releases. No contamination in the dataset for eval results neither whatsoever therefore.

2: It's all about parameter tuning as well as the quality of your dataset etc. Aim for high quality, not quantity. One bad entry can leave traces and contaminate the whole outcome if you are "unlucky" and the model catches on it.

Also what i've noticed is that most tunes get most affected in the math evals, unless they contaminate their training with eval data.

tanliboy

Sep 17, 2024

Thanks, @Orenguteng !
Could I know if your fine-tuning result is SFT only or does it include preference alignment (like DPO or RLHF/RLAIF)?

Orenguteng

Sep 17, 2024

@tanliboy I'm using a custom built framework, with different methods combined, all customized. Can't provide more details unfortunately but it's not impossible, my initial tune had worse results for math (3.1 V1) the V2 improved. V3 will be better.

tanliboy

Sep 18, 2024

@Orenguteng I understand. Looking forward to your V3 result.