61 121 170

Qian Liu

SivilTaram

http://siviltaram.github.io/

AI & ML interests

Cooking cool things

Recent Activity

updated a model about 1 hour ago

SivilTaram/tongyao_models

updated a model about 1 hour ago

SivilTaram/tongyao_models

updated a model about 1 hour ago

SivilTaram/tongyao_models

View all activity

Organizations

Posts 4

Post

2663

Still following your human intuition to mix corpora from different sources for pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄

🔬 In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈

📄 Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
💻 Code: https://github.com/sail-sg/regmix
📊 Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
🎮 Demo: https://huggingface.co/spaces/sail/RegMix

View all Posts