HuggingFaceFW

Enterprise
community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

anton-lΒ  updated a dataset 5 days ago
HuggingFaceFW/fineweb-edu
guipenedoΒ  updated a Space 7 days ago
HuggingFaceFW/blogpost-fineweb-v1
guipenedoΒ  updated a dataset 7 days ago
HuggingFaceFW/fineweb
View all activity

HuggingFaceFW's activity

anton-lΒ 
posted an update 6 days ago
view post
Post
1950
Introducing πŸ“π…π’π§πžπŒπšπ­π‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
πŸ› οΈ carefully extracting math data from Common Crawl;
πŸ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! πŸš€
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
thomwolfΒ 
posted an update 16 days ago
view post
Post
4324
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
Β·
garrethleeΒ 
posted an update 19 days ago
view post
Post
1882
The latest o1 model from OpenAI is still unable to answer 9.11 > 9.9 correctly πŸ€”

A possible explanation? Tokenization - and our latest work investigates how it affects a model's ability to do math!

In this blog post, we discuss:
πŸ”’ The different ways numbers are tokenized in modern LLMs
πŸ§ͺ Our detailed approach in comparing these various methods
πŸ₯ͺ How we got a free boost in arithmetic performance by adding a few lines of code to the base Llama 3 tokenizer
πŸ‘‘ and a definitive, best tokenization method for math in LLMs!

Check out our work here: huggingface/number-tokenization-blog
  • 2 replies
Β·
thomwolfΒ 
posted an update 19 days ago
thomwolfΒ 
posted an update 21 days ago
garrethleeΒ 
posted an update 26 days ago
view post
Post
359
Does tokenizing numbers into single digits outperform three-digit or BPE tokenization for arithmetic tasks? We explore various tokenization methods in our upcoming blog (releasing next week πŸ‘€)!

πŸ”Ή Bringing objectivity to comparisons

Existing comparisons of number tokenization methods often ignore the difference in models’ compute budgets: larger tokenizer vocabularies naturally lead to more parameters, which produces less objective comparisons of model performances due to more β€œlearning” being done by these bigger models.

We addressed this by keeping architectures consistent but adjusting the number of hidden layers to produce roughly equal parameter counts.

πŸ”Ή Key results

We trained models on the same data mix and evaluated their performance on various arithmetic tasks (digits, operations, floats vs. ints):

- When splitting evals based on operators, single-digit tokenization consistently outperformed other methods.
- Right-to-left tokenization (which I covered in a previous post) matched or exceeded left-to-right approaches in all tasks.

All in all, single-digit tokenization is best compared to other methods, and similar to our previous post’s finding, R2L works better than L2R tokenization, although not as significant as the gap between single-digit and the rest!

The wait is almost over πŸ€—, the full report is coming next week - stay tuned!
loubnabnlΒ 
posted an update about 1 month ago
view post
Post
1632
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit πŸ› οΈ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
thomwolfΒ 
posted an update about 1 month ago
SaylorTwiftΒ 
posted an update about 1 month ago
thomwolfΒ 
posted an update about 1 month ago
thomwolfΒ 
posted an update 2 months ago
view post
Post
4114
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around πŸ€–βœ¨
  • 2 replies
Β·