[IFEVAL Dataset] Inquiry on Performance Metrics Decrease in LLaMA 3.1 Strict Levels Between July 18 and 22 Versions
Dear LLaMA Development Team,
I am reaching out with some observations from our performance evaluation using the IFEVAL dataset on the LLaMA 3.1 model, specifically comparing the versions from July 18th and July 22nd.
Performance Metrics Comparison Table
Our testing has revealed a decrease in certain performance metrics between the two versions. Below is a detailed comparison presented in a tabular format for clarity:
Parameter | Strict-Prompt-Level | Strict-Instruction-Level | Loose-Prompt-Level | Loose-Instruction-Level | Avg-Level |
---|---|---|---|---|---|
18-Jul | 0.76155268 | 0.824940048 | 0.813308688 | 0.862110312 | 0.815477932 |
22-Jul | 0.733826248 | 0.810551559 | 0.772643253 | 0.842925659 | 0.78998668 |
Change | Decrease | Decrease | Increase | Increase | Decrease |
Performance Metrics Analysis
Our testing has identified a decrease in the following metrics, with the frequency of each metric being evaluated as follows:
Metric Category | Specific Metric | Frequency |
---|---|---|
Punctuation | no_comma | 11 |
Length Constraints | number_words | 8 |
Keywords | frequency | 8 |
Detectable Format | number_highlighted_sections | 7 |
Language | response_language | 7 |
Length Constraints | number_paragraphs | 7 |
Start/End | quotation | 6 |
Change Case | english_lowercase | 5 |
Combination | two_responses | 5 |
Keywords | existence | 5 |
Keywords | forbidden_words | 5 |
Detectable Content | number_placeholders | 5 |
Detectable Format | number_bullet_lists | 4 |
Change Case | english_capital | 4 |
Detectable Content | postscript | 4 |
Length Constraints | nth_paragraph_first_word | 4 |
Change Case | capital_word_frequency | 4 |
Length Constraints | number_sentences | 4 |
Keywords | letter_frequency | 3 |
Detectable Format | json_format | 3 |
Combination | repeat_prompt | 2 |
Detectable Format | title | 2 |
Detectable Format | multiple_sections | 2 |
Start/End | end_checker | 2 |
Detectable Format | constrained_response | 1 |
We are particularly interested in understanding the reasons behind the performance decrease. Could you provide insights into what might have led to this change? It would be helpful to know if this was an intentional adjustment or an unintended consequence of the version updates.
Your guidance on this matter will be instrumental for our ongoing integration and reliance on the LLaMA 3.1 model in our applications. We appreciate any information or clarification you can provide.