Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 β’ 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 β’ 16
Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? May 7, 2024 β’ 8
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 β’ 74
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 29
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training Paper β’ 2501.18511 β’ Published 4 days ago β’ 15 β’ 4
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Paper β’ 2501.09653 β’ Published 19 days ago β’ 12 β’ 2