Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
grimjim 
posted an update 7 days ago
Post
1295
I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".

I suspect this is the case as well. If you grep some of the datasets the bit popular finetunes use for "I will not" -- or "Your post has been removed", you'll find quite a few of them in there.

deleted
This comment has been hidden
In this post