A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity Paper • 2305.13169 • Published May 22, 2023 • 3
Best Practices and Lessons Learned on Synthetic Data for Language Models Paper • 2404.07503 • Published Apr 11 • 29
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published Jun 28 • 94
DDK: Distilling Domain Knowledge for Efficient Large Language Models Paper • 2407.16154 • Published Jul 23 • 20
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models Paper • 2408.02085 • Published Aug 4 • 17
Better Alignment with Instruction Back-and-Forth Translation Paper • 2408.04614 • Published Aug 8 • 14
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community Paper • 2408.08291 • Published Aug 15 • 9