meta-llama/Llama-3.1-8B-Instruct · [IFEVAL Dataset] Inquiry on Performance Metrics Decrease in LLaMA 3.1 Strict Levels Between July 18 and 22 Versions

Dear LLaMA Development Team,

I am reaching out with some observations from our performance evaluation using the IFEVAL dataset on the LLaMA 3.1 model, specifically comparing the versions from July 18th and July 22nd.

Performance Metrics Comparison Table

Our testing has revealed a decrease in certain performance metrics between the two versions. Below is a detailed comparison presented in a tabular format for clarity:

Parameter	Strict-Prompt-Level	Strict-Instruction-Level	Loose-Prompt-Level	Loose-Instruction-Level	Avg-Level
18-Jul	0.76155268	0.824940048	0.813308688	0.862110312	0.815477932
22-Jul	0.733826248	0.810551559	0.772643253	0.842925659	0.78998668
Change	Decrease	Decrease	Increase	Increase	Decrease

Performance Metrics Analysis

Our testing has identified a decrease in the following metrics, with the frequency of each metric being evaluated as follows:

Metric Category	Specific Metric	Frequency
Punctuation	no_comma	11
Length Constraints	number_words	8
Keywords	frequency	8
Detectable Format	number_highlighted_sections	7
Language	response_language	7
Length Constraints	number_paragraphs	7
Start/End	quotation	6
Change Case	english_lowercase	5
Combination	two_responses	5
Keywords	existence	5
Keywords	forbidden_words	5
Detectable Content	number_placeholders	5
Detectable Format	number_bullet_lists	4
Change Case	english_capital	4
Detectable Content	postscript	4
Length Constraints	nth_paragraph_first_word	4
Change Case	capital_word_frequency	4
Length Constraints	number_sentences	4
Keywords	letter_frequency	3
Detectable Format	json_format	3
Combination	repeat_prompt	2
Detectable Format	title	2
Detectable Format	multiple_sections	2
Start/End	end_checker	2
Detectable Format	constrained_response	1

We are particularly interested in understanding the reasons behind the performance decrease. Could you provide insights into what might have led to this change? It would be helpful to know if this was an intentional adjustment or an unintended consequence of the version updates.

Your guidance on this matter will be instrumental for our ongoing integration and reliance on the LLaMA 3.1 model in our applications. We appreciate any information or clarification you can provide.