Thomas Wolf's picture

Thomas Wolf PRO

thomwolf

AI & ML interests

NLP and open-source :-)

Recent Activity

Articles

Organizations

Hugging Face's profile picture Natural Language Processing with Transformers's profile picture BigScience Workshop's profile picture Flax Community's profile picture datablations's profile picture Training Transformers Together's profile picture BigScience Data's profile picture Evaluation datasets's profile picture HuggingFaceBR4's profile picture Godot Engine Demos's profile picture OpenAssistant's profile picture Evaluation on the Hub's profile picture HuggingFaceM4's profile picture Simulation Environments Tests and Builds's profile picture (De)fusing's profile picture HuggingFaceGECLM's profile picture CodeParrot's profile picture BigCode's profile picture Hugging Face H4's profile picture CV as NLP's profile picture Explorer of Simulate alpha's profile picture BigCode Data's profile picture Hugging Face Extreme-Scale's profile picture Hugging Face H4 Community's profile picture GAIA's profile picture Hugging Face TB Research's profile picture Hugging Face Smol Cluster's profile picture Open LLM Leaderboard's profile picture TTS Eval (OLD)'s profile picture the circle of truth - war scene's profile picture Nanotron Research's profile picture LeRobot's profile picture Journalists on Hugging Face's profile picture NewTechKids's profile picture MLX Community's profile picture Hugging Face Assignments's profile picture HuggingFaceFW's profile picture TTS AGI's profile picture Social Post Explorers's profile picture dora-rs's profile picture HuggingFaceEval's profile picture HuggingFaceFW-Dev's profile picture Hugging Face Discord Community's profile picture DataComp 's profile picture Data Agents's profile picture Hugging Face FineVideo's profile picture HuggingFace Science Team's profile picture Art's profile picture smol-explorers's profile picture Nerdy Face's profile picture Hugging Face Science's profile picture LeMaterial's profile picture open/ acc's profile picture

thomwolf's activity

reacted to julien-c's post with 😎🤝👍🤗❤️🔥 14 days ago
view post
Post
7575
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
posted an update 16 days ago
view post
Post
4326
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
reacted to garrethlee's post with ❤️🔥 19 days ago
view post
Post
1882
The latest o1 model from OpenAI is still unable to answer 9.11 > 9.9 correctly 🤔

A possible explanation? Tokenization - and our latest work investigates how it affects a model's ability to do math!

In this blog post, we discuss:
🔢 The different ways numbers are tokenized in modern LLMs
🧪 Our detailed approach in comparing these various methods
🥪 How we got a free boost in arithmetic performance by adding a few lines of code to the base Llama 3 tokenizer
👑 and a definitive, best tokenization method for math in LLMs!

Check out our work here: huggingface/number-tokenization-blog
  • 2 replies
·
posted an update 19 days ago
posted an update 21 days ago
reacted to m-ric's post with 🚀🔥 22 days ago
view post
Post
1269
🤖 𝗔𝗱𝗼𝗯𝗲'𝘀 𝗰𝗼𝗱𝗲-𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁 𝗿𝗲𝗮𝗰𝗵𝗲𝘀 𝘁𝗵𝗲 𝘁𝗼𝗽 𝗼𝗳 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 - and their paper cites my work!

💡 Reminder: In short, Agentic systems are a vehicle in which you put your LLM to allow it access to the outside world.

➡️ The team of researchers at Adobe started from the idea that current agentic systems lack the ability to define their own tools. So they decided to make an agent that writes actions as code, thus allowing it to write python functions that can be re-used later as tools!

Here's what the LLM generations can look like with the proper prompt:

Thought: I need to access the excel file using a different method.
Action:
def access_excel_file(file_path)
	... # rest of the code (the agent does writes it, but I don't have room in this post)
	return rows


Then your system executes this and appends the observation to the agent's memory.

Why is this code formulation better than classical tool use formulation as JSON? The paper explains:

"Most existing work uses text or JSON as the representation of actions, which significantly lacks the two criteria mentioned earlier: generality and composability. In contrast, DynaSaur can utilize available actions or create new ones if necessary, using code as a unified representation. In principle, acting with code enables agents to solve any Turing-complete problem."

The idea of using code is not new: in fact, we do it in transformers.agents (thus the citation that I got). They implementation adds further refinements, like using RAG to retrieve relevant functions before generating an action, which increases performance further.

And they observe that code agents perform much better, reaching the top of GAIA leaderboard! 🥇

Go take a look, it's really clear and informative!

Paper added to my agents collection 👉 m-ric/agents-65ba776fbd9e29f771c07d4e
reacted to merve's post with 🔥👍 27 days ago
view post
Post
2160
The authors of ColPali trained a retrieval model based on SmolVLM 🤠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 💗
reacted to davanstrien's post with ❤️ 28 days ago
view post
Post
2470
First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

Excited to see people build more open tools for a more open social media platform!
reacted to ZennyKenny's post with 👍 30 days ago
view post
Post
1205
I've joined the Bluesky community. Interested to see what decentralized social media looks like in action: https://bsky.app/profile/kghamilton.bsky.social

Looking forward to following other AI builders, tech enthusiasts, goth doomscrollers, and ironic meme creators.
reacted to as-cle-bert's post with 🔥 30 days ago
view post
Post
1257
Hi HuggingFacers!🤗
I'm thrilled to introduce my latest project: 𝗦𝗲𝗻𝗧𝗿𝗘𝘃 (𝗦𝗲𝗻tence 𝗧𝗿ansformers 𝗘𝘃aluator), a python package that offers simple customizable evaluation for text retrieval accuracy and time performance of Sentence Transformers-compatible text embedders on PDF data!📊

Learn more in my LinkedIn post: https://www.linkedin.com/posts/astra-clelia-bertelli-583904297_python-embedders-semanticsearch-activity-7266754133557190656-j1e3

And on the GitHub repo: https://github.com/AstraBert/SenTrEv

Have fun!🍕
posted an update about 1 month ago
replied to nyuuzyou's post about 1 month ago