data - a leonardlin Collection

leonardlin 's Collections

speed

sota

evals

tuning

rag

context

safety

image

vision

code

prompt injection

TOREAD

data

voice

data

updated Aug 17, 2024

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Paper • 2305.13169 • Published May 22, 2023 • 3
A Survey on Data Selection for Language Models

Paper • 2402.16827 • Published Feb 26, 2024 • 4
HuggingFaceFW/fineweb-edu

Viewer • Updated Jan 31 • 3.3B • 499k • 648
allenai/MADLAD-400

Updated Sep 9, 2024 • 437k • 138
uonlp/CulturaX

Viewer • Updated Dec 16, 2024 • 7.18B • 15.4k • 497
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11, 2024 • 30
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28, 2024 • 97
DDK: Distilling Domain Knowledge for Efficient Large Language Models

Paper • 2407.16154 • Published Jul 23, 2024 • 22
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Paper • 2408.02085 • Published Aug 4, 2024 • 19
Better Alignment with Instruction Back-and-Forth Translation

Paper • 2408.04614 • Published Aug 8, 2024 • 16
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

Paper • 2408.08291 • Published Aug 15, 2024 • 11