AI & ML interests

None defined yet.

Recent Activity

bigscience-data's activity

albertvillanova 
posted an update 5 days ago
yjernite 
posted an update about 1 month ago
view post
Post
2100
🇪🇺 Policy Thoughts in the EU AI Act Implementation 🇪🇺

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks
  • 2 replies
·
lhoestq 
posted an update about 1 month ago
view post
Post
1697
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
thomwolf 
posted an update about 1 month ago
view post
Post
4717
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
christopher 
posted an update about 1 month ago
view post
Post
1599
The folks at Foursquare released a dataset of 104.5 million places of interest ( foursquare/fsq-os-places) and here's all of them on a plot
·
christopher 
posted an update about 1 month ago
thomwolf 
posted an update about 1 month ago
thomwolf 
posted an update about 1 month ago
loubnabnl 
posted an update about 2 months ago
view post
Post
1831
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
thomwolf 
posted an update about 2 months ago
albertvillanova 
posted an update about 2 months ago
view post
Post
1515
🚨 How green is your model? 🌱 Introducing a new feature in the Comparator tool: Environmental Impact for responsible #LLM research!
👉 open-llm-leaderboard/comparator
Now, you can not only compare models by performance, but also by their environmental footprint!

🌍 The Comparator calculates CO₂ emissions during evaluation and shows key model characteristics: evaluation score, number of parameters, architecture, precision, type... 🛠️
Make informed decisions about your model's impact on the planet and join the movement towards greener AI!
thomwolf 
posted an update about 2 months ago
albertvillanova 
posted an update 2 months ago
view post
Post
1561
🚀 New feature of the Comparator of the 🤗 Open LLM Leaderboard: now compare models with their base versions & derivatives (finetunes, adapters, etc.). Perfect for tracking how adjustments affect performance & seeing innovations in action. Dive deeper into the leaderboard!

🛠️ Here's how to use it:
1. Select your model from the leaderboard.
2. Load its model tree.
3. Choose any base & derived models (adapters, finetunes, merges, quantizations) for comparison.
4. Press Load.
See side-by-side performance metrics instantly!

Ready to dive in? 🏆 Try the 🤗 Open LLM Leaderboard Comparator now! See how models stack up against their base versions and derivatives to understand fine-tuning and other adjustments. Easier model analysis for better insights! Check it out here: open-llm-leaderboard/comparator 🌐
asoria 
posted an update 2 months ago
view post
Post
1834
🚀 Exploring Topic Modeling with BERTopic 🤖

When you come across an interesting dataset, you often wonder:
Which topics frequently appear in these documents? 🤔
What is this data really about? 📊

Topic modeling helps answer these questions by identifying recurring themes within a collection of documents. This process enables quick and efficient exploratory data analysis.

I’ve been working on an app that leverages BERTopic, a flexible framework designed for topic modeling. Its modularity makes BERTopic powerful, allowing you to switch components with your preferred algorithms. It also supports handling large datasets efficiently by merging models using the BERTopic.merge_models approach. 🔗

🔍 How do we make this work?
Here’s the stack we’re using:

📂 Data Source ➡️ Hugging Face datasets with DuckDB for retrieval
🧠 Text Embeddings ➡️ Sentence Transformers (all-MiniLM-L6-v2)
⚡ Dimensionality Reduction ➡️ RAPIDS cuML UMAP for GPU-accelerated performance
🔍 Clustering ➡️ RAPIDS cuML HDBSCAN for fast clustering
✂️ Tokenization ➡️ CountVectorizer
🔧 Representation Tuning ➡️ KeyBERTInspired + Hugging Face Inference Client with Meta-Llama-3-8B-Instruct
🌍 Visualization ➡️ Datamapplot library
Check out the space and see how you can quickly generate topics from your dataset: datasets-topics/topics-generator

Powered by @MaartenGr - BERTopic
albertvillanova 
posted an update 3 months ago
view post
Post
3140
🚀 Exciting update! You can now compare multiple models side-by-side with the Hugging Face Open LLM Comparator! 📊

open-llm-leaderboard/comparator

Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?
thomwolf 
posted an update 3 months ago
view post
Post
4154
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨
  • 2 replies
·
albertvillanova 
posted an update 3 months ago
view post
Post
1232
🚨 Instruct-tuning impacts models differently across families! Qwen2.5-72B-Instruct excels on IFEval but struggles with MATH-Hard, while Llama-3.1-70B-Instruct avoids MATH performance loss! Why? Can they follow the format in examples? 📊 Compare models: open-llm-leaderboard/comparator