Effective pruning of web-scale datasets based on complexity of concept clusters Paper • 2401.04578 • Published Jan 9
LESS: Selecting Influential Data for Targeted Instruction Tuning Paper • 2402.04333 • Published Feb 6 • 3
LongAlign: A Recipe for Long Context Alignment of Large Language Models Paper • 2401.18058 • Published Jan 31 • 21
LongHeads: Multi-Head Attention is Secretly a Long Context Processor Paper • 2402.10685 • Published Feb 16 • 1
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16 • 29
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20 • 46
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions Paper • 2401.00690 • Published Jan 1 • 1
Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models Paper • 2402.11532 • Published Feb 18
Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond Paper • 2402.17327 • Published Feb 27
Parallel Structures in Pre-training Data Yield In-Context Learning Paper • 2402.12530 • Published Feb 19
Less is More: Data Value Estimation for Visual Instruction Tuning Paper • 2403.09559 • Published Mar 14
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4 • 27
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance Paper • 2403.16952 • Published Mar 25 • 1
Data-Juicer: A One-Stop Data Processing System for Large Language Models Paper • 2309.02033 • Published Sep 5, 2023 • 3
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition Paper • 2310.05492 • Published Oct 9, 2023 • 2
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese Paper • 2403.13638 • Published Mar 20
Revisiting Token Dropping Strategy in Efficient BERT Pretraining Paper • 2305.15273 • Published May 24, 2023
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 86