README / README.md
guipenedo's picture
guipenedo HF staff
Update README.md
f2ea008 verified
|
raw
history blame
921 Bytes
metadata
title: README
emoji: πŸ‘€
colorFrom: purple
colorTo: pink
sdk: static
pinned: false

πŸ€— HuggingFace 🍷 FineWeb datasets

This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).

The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.

All code and artefacts needed for reproduction are public and built on top of open source libraries, like the πŸ€— libraries datatrove, nanotron or lighteval.

Currently releasing v1