HuggingPartyParis (huggingPartyParis)

lysandre

posted an update 4 days ago

Post

5075

SmolVLM-2 and SigLIP-2 are now part of transformers in dedicated releases!

They're added on top of the v4.49.0 release, and can be installed from the following tags: v4.49.0-SmolVLM-2 and v4.49.0-SigLIP-2.

This marks a new beginning for the release process of transformers. For the past five years, we've been doing monthly releases featuring many models (v4.49.0, the latest release, features 9 new architectures).

Starting with SmolVLM-2 & SigLIP2, we'll now additionally release tags supporting new models on a stable branch. These models are therefore directly available for use by installing from the tag itself. These tags will continue to be updated with fixes applied to these models.

Going forward, continue expecting software releases following semantic versioning: v4.50.0 will have ~10 new architectures compared to v4.49.0, as well as a myriad of new features, improvements and bug fixes. Accompanying these software releases, we'll release tags offering brand new models as fast as possible, to make them accessible to all immediately.

1 reply

·

RedTachyon

authored a paper 18 days ago

PILAF: Optimal Human Preference Sampling for Reward Modeling

Paper • 2502.04270 • Published 19 days ago • 11

loubnabnl

authored a paper 19 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 21 days ago • 192

ameerazam08

posted an update 26 days ago

Post

2014

Diffusion-Eraser
ameerazam08/Diffusion-Eraser

RuoyiDu

authored a paper about 1 month ago

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Paper • 2501.13920 • Published Jan 23 • 15

loubnabnl

posted an update 3 months ago

Post

2788

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

wissamantoun

authored a paper 3 months ago

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Paper • 2411.08868 • Published Nov 13, 2024 • 12

RuoyiDu

authored a paper 4 months ago

I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

Paper • 2410.07536 • Published Oct 10, 2024 • 5

antonioloison

authored a paper 6 months ago

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

Paper • 2409.06595 • Published Sep 10, 2024 • 38

Sharvareev

authored a paper 7 months ago

DC3DO: Diffusion Classifier for 3D Objects

Paper • 2408.06693 • Published Aug 13, 2024 • 11

wissamantoun

authored a paper 7 months ago

Harvesting Textual and Structured Data from the HAL Publication Repository

Paper • 2407.20595 • Published Jul 30, 2024 • 22

loubnabnl

authored a paper 8 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 93

pierlui92

authored a paper 8 months ago

LLaNA: Large Language and NeRF Assistant

Paper • 2406.11840 • Published Jun 17, 2024 • 18

loubnabnl

posted an update 9 months ago

Post

5874

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!