--- title: README emoji: 👀 colorFrom: purple colorTo: pink sdk: static pinned: false --- # 🤗 HuggingFace 🍷 FineWeb datasets This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)). The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models. All code and artefacts needed for reproduction are public and built on top of open source libraries, like the 🤗 libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/). _Currently releasing v1_