Hugging Face TB Research

Enterprise

community

AI & ML interests

Exploring smol models and high quality web and synthetic datasets, generated by LLMs (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)

Recent Activity

anton-l new activity 1 day ago

HuggingFaceTB/finemath:[bot] Conversion to Parquet

anton-l updated a dataset 1 day ago

HuggingFaceTB/math_tasks

loubnabnl new activity 1 day ago

HuggingFaceTB/finemath:Why did you use CC rather than FineWeb to create FineMath?

View all activity

HuggingFaceTB's activity

merve

posted an update about 14 hours ago

Post

541

QwQ can see 🔥
Qwen team released QvQ, a large vision LM with reasoning 😱

it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out ⬇️
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!

anton-l

in HuggingFaceTB/finemath 1 day ago

[bot] Conversion to Parquet

#1 opened 5 days ago by

parquet-converter

anton-l

updated a dataset 1 day ago

HuggingFaceTB/math_tasks

Viewer • Updated 1 day ago • 21.3k • 52 • 1

loubnabnl

in HuggingFaceTB/finemath 1 day ago

Why did you use CC rather than FineWeb to create FineMath?

#3 opened 2 days ago by

CryptAL

anton-l

in HuggingFaceTB/finemath 2 days ago

[Bug] cannot get prompts

#2 opened 2 days ago by

BigDong

anton-l

updated a dataset 2 days ago

HuggingFaceTB/finemath

Viewer • Updated 2 days ago • 48.3M • 6.86k • 144

Xenova

in HuggingFaceTB/SmolLM-1.7B 3 days ago

onnx model has additional unknown input

#7 opened 4 days ago by

SantoshHF

davanstrien

posted an update 5 days ago

Post

1503

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

loubnabnl

updated a Space 5 days ago

Running

👁

README

anton-l

updated a Space 5 days ago

Running

👁

README

anton-l

posted an update 6 days ago

Post

1951

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2