CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Abstract
The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.
Community
I read the Arxiv paper on CulturaX so you don't have to. Here's my highlights:
- New open dataset called CulturaX contains text data for 167 languages - far more than previous datasets.
- With over 6 trillion words, it's the largest multilingual dataset ever released.
- Freely available for anyone to use for research and AI development.
- Created by combining and extensively cleaning two other large datasets - mC4 and OSCAR.
- Could allow developing AI systems that work much better across many more languages.
- Helps democratize access to data to build fairer, less biased AI models.
- Allows training of new multilingual AI applications, like universal translators and assistants.
- But still requires thoughtfulness to avoid issues like bias amplification.
Overall, CulturaX is going to be part of a broader global trend (I think) to advance multilingual AI and spread its benefits more equally. So far they've been concentrated in English-speaking applications.
Full summary here if you'd like to read more. Original paper is here.
Three of the dataset files have been flagged as harmful.
ru/ru_part_00211.parquet: Virus: Win.Trojan.URLspoof-1
zh/zh_part_00280.parquet: Virus: Win.Trojan.N-69
ru/ru_part_00639.parquet: Virus: Vbs.Worm.CoolNote-2
What have the author done to address this?