
Libre Euro Lingua-Alliance
community
AI & ML interests
nlp
LEL-A's activity
Post
1539
My latest project is the outcome of the last 2+ years working with TPUs from the amazing TPU Research Cloud (TRC) program and training Encoder-only LMs with the TensorFlow Model Garden library.
👉 Link: https://github.com/stefan-it/model-garden-lms
An overview of some features:
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
👉 Model Hub Link: https://huggingface.co/model-garden-lms
If you find these resources useful, please give them a like!
Made from Bavarian Oberland with ❤️ and 🥨.
👉 Link: https://github.com/stefan-it/model-garden-lms
An overview of some features:
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
👉 Model Hub Link: https://huggingface.co/model-garden-lms
If you find these resources useful, please give them a like!
Made from Bavarian Oberland with ❤️ and 🥨.

dvilasuero
authored
a
paper
3 months ago

dvilasuero
posted
an
update
3 months ago
Post
2392
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Dataset: CohereForAI/Global-MMLU
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Dataset: CohereForAI/Global-MMLU

dvilasuero
posted
an
update
3 months ago
Post
1176
@Jesse-marqo
and the Marqo team are killing it on the Hub: top embedding models and datasets!
Here's how to start using their new evaluation dataset for curation and labelling:
1. Deploy Argilla on Spaces: https://huggingface.co/new-space?template=argilla%2Fargilla-template-space
2. Load Marqo/amazon-products-eval with the UI wizard.
3. Start curating!
Here's how to start using their new evaluation dataset for curation and labelling:
1. Deploy Argilla on Spaces: https://huggingface.co/new-space?template=argilla%2Fargilla-template-space
2. Load Marqo/amazon-products-eval with the UI wizard.
3. Start curating!

dvilasuero
posted
an
update
4 months ago
Post
693
Build datasets for AI on the Hugging Face Hub—10x easier than ever!
Today, I'm excited to share our biggest feature since we joined Hugging Face.
Here’s how it works:
1. Pick a dataset—upload your own or choose from 240K open datasets.
2. Paste the Hub dataset ID into Argilla and set up your labeling interface.
3. Share the URL with your team or the whole community!
And the best part? It’s:
- No code – no Python needed
- Integrated – all within the Hub
- Scalable – from solo labeling to 100s of contributors
I am incredibly proud of the team for shipping this after weeks of work and many quick iterations.
Let's make this sentence obsolete: "Everyone wants to do the model work, not the data work."
Read, share, and like the HF blog post:
https://huggingface.co/blog/argilla-ui-hub
Today, I'm excited to share our biggest feature since we joined Hugging Face.
Here’s how it works:
1. Pick a dataset—upload your own or choose from 240K open datasets.
2. Paste the Hub dataset ID into Argilla and set up your labeling interface.
3. Share the URL with your team or the whole community!
And the best part? It’s:
- No code – no Python needed
- Integrated – all within the Hub
- Scalable – from solo labeling to 100s of contributors
I am incredibly proud of the team for shipping this after weeks of work and many quick iterations.
Let's make this sentence obsolete: "Everyone wants to do the model work, not the data work."
Read, share, and like the HF blog post:
https://huggingface.co/blog/argilla-ui-hub

dvilasuero
posted
an
update
4 months ago
Post
996
Big news! You can now build strong ML models without days of human labelling
You simply:
- Define your dataset, including annotation guidelines, labels and fields
- Optionally label some records manually.
- Use an LLM to auto label your data with a human (you? your team?) in the loop!
Get started with this blog post:
https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback
You simply:
- Define your dataset, including annotation guidelines, labels and fields
- Optionally label some records manually.
- Use an LLM to auto label your data with a human (you? your team?) in the loop!
Get started with this blog post:
https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback

dvilasuero
posted
an
update
5 months ago
Post
416
Explore FinePersonas, visually with Argilla and
black-forest-labs/FLUX.1-schnell
Excited to share this space where the community can explore a tiny subset of FinePersonas
argilla/finepersonas
Dataset built with distilabel and Free Serveless endpoints
This is just a first step towards more interesting experiments with FinePersonas, for example can we use it to assess biases in text2image models?
If you have ideas I'd love to hear them in the comments!
Excited to share this space where the community can explore a tiny subset of FinePersonas
argilla/finepersonas
Dataset built with distilabel and Free Serveless endpoints
This is just a first step towards more interesting experiments with FinePersonas, for example can we use it to assess biases in text2image models?
If you have ideas I'd love to hear them in the comments!

dvilasuero
posted
an
update
9 months ago
Post
8167
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!
We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.
Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets
After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.
To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.
As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.
Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.
Would love to answer any questions you have so feel free to add them below!
We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.
Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets
After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.
To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.
As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.
Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.
Would love to answer any questions you have so feel free to add them below!
Post
3344
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH
mLLM - first release:
orca_dpo_pairs by Intel (translated into 7 languages)
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH
Upcoming:
- more datasets
- cleaning steps
- a blogpost
- stay updated at https://hf.co/multilingual
multilingual/orca_dpo_pairs
mLLM - first release:
orca_dpo_pairs by Intel (translated into 7 languages)
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH
Upcoming:
- more datasets
- cleaning steps
- a blogpost
- stay updated at https://hf.co/multilingual
multilingual/orca_dpo_pairs

philschmid
posted
an
update
11 months ago
Post
7707
New state-of-the-art open LLM! 🚀 Databricks just released DBRX, a 132B MoE trained on 12T tokens. Claiming to surpass OpenAI GPT-3.5 and is competitive with Google Gemini 1.0 Pro. 🤯
TL;DR
🧮 132B MoE with 16 experts with 4 active in generation
🪟 32 000 context window
📈 Outperforms open LLMs on common benchmarks, including MMLU
🚀 Up to 2x faster inference than Llama 2 70B
💻 Trained on 12T tokens
🔡 Uses the GPT-4 tokenizer
📜 Custom License, commercially useable
Collection: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: https://huggingface.co/spaces/databricks/dbrx-instruct
Kudos to the Team at Databricks and MosaicML for this strong release in the open community! 🤗
TL;DR
🧮 132B MoE with 16 experts with 4 active in generation
🪟 32 000 context window
📈 Outperforms open LLMs on common benchmarks, including MMLU
🚀 Up to 2x faster inference than Llama 2 70B
💻 Trained on 12T tokens
🔡 Uses the GPT-4 tokenizer
📜 Custom License, commercially useable
Collection: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: https://huggingface.co/spaces/databricks/dbrx-instruct
Kudos to the Team at Databricks and MosaicML for this strong release in the open community! 🤗

dvilasuero
posted
an
update
12 months ago
Post
🔥 Community and Data Quality Are More For Alignment
A recipe to replicate SPIN (Self-Play Fine Tuning) with 30x less data:
🗣️ 50K samples vs 1.8K prompts curated by the 350+ amazing DIBT contributors.
⚗️ Distillation of Mistral Large instead of OpenAI
🙌 Open data & code with ⚗️distilabel
SPIN Paper:
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)
SPIN DIBT Collection with datasets and models:
argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3
Repo:
https://github.com/argilla-io/distilabel-spin-dibt
Joint work with the amazing DIBT community 👇
@aashish1904 , @flozi00 , @sayhan , @munish0838 , @0-hero , @dvilasuero , @eren23 , @davanstrien , @ahnz , @BlackKakapo , @kitano-o , @mmhamdy , @sdiazlor , @Stopwolf , @gabrielmbmb , @tculler91 , @plaguss , @ignacioct , @Hugi-R , @davidberenstein1957 , @Korla , @alvarobartt , @Hugs4Llamas , @Sumandora , @nataliaElv , @jfcalvo , @Averill , @steventrouble , @vasilis , @aeros93 , @kayyshf , @thomasgauthier , @jeromebas , @Ameeeee , @ayoubelmhamdi , @TuringsSolutions , @efels , @Haleyok , @abrazador , @emessy , @Nindaleth , @burtenshaw , @vicgalle , @CortexPE , @casey-martin , @Leire-aguirre-eguiluz , @mrfakename , @Portias600kNeurons , @nathaliepett , @Filippo
A recipe to replicate SPIN (Self-Play Fine Tuning) with 30x less data:
🗣️ 50K samples vs 1.8K prompts curated by the 350+ amazing DIBT contributors.
⚗️ Distillation of Mistral Large instead of OpenAI
🙌 Open data & code with ⚗️distilabel
SPIN Paper:
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)
SPIN DIBT Collection with datasets and models:
argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3
Repo:
https://github.com/argilla-io/distilabel-spin-dibt
Joint work with the amazing DIBT community 👇
@aashish1904 , @flozi00 , @sayhan , @munish0838 , @0-hero , @dvilasuero , @eren23 , @davanstrien , @ahnz , @BlackKakapo , @kitano-o , @mmhamdy , @sdiazlor , @Stopwolf , @gabrielmbmb , @tculler91 , @plaguss , @ignacioct , @Hugi-R , @davidberenstein1957 , @Korla , @alvarobartt , @Hugs4Llamas , @Sumandora , @nataliaElv , @jfcalvo , @Averill , @steventrouble , @vasilis , @aeros93 , @kayyshf , @thomasgauthier , @jeromebas , @Ameeeee , @ayoubelmhamdi , @TuringsSolutions , @efels , @Haleyok , @abrazador , @emessy , @Nindaleth , @burtenshaw , @vicgalle , @CortexPE , @casey-martin , @Leire-aguirre-eguiluz , @mrfakename , @Portias600kNeurons , @nathaliepett , @Filippo

dvilasuero
posted
an
update
12 months ago
Post
🚀🧙🏼♂️Introducing OpenHermesPreferences: the largest open AI feedback dataset for RLHF & DPO
> Using LLMs to improve other LLMs, at scale!
Built in collaboration with the H4 Hugging Face team, it's a 1M preferences dataset on top of the amazing @teknium 's dataset.
Dataset:
argilla/OpenHermesPreferences
The dataset is another example of open collaboration:
> The H4 team created responses with Mixtral using llm-swarm
> Argilla created responses with NousResearch Hermes-2-Yi-34B using distilabel
> The H4 ranked these responses + original response with PairRM from AllenAI, University of Southern California, Zhejiang University ( @yuchenlin @DongfuTingle and colleagues)
We hope this dataset will help the community's research efforts towards understanding the role of AI feedback for LLM alignment.
We're particularly excited about the ability of filtering specific subsets to improve LLM skills like math or reasoning.
Here's how easy it is to filter by subset:
As usual, all the scripts to reproduce this work are available and open to the community!
argilla/OpenHermesPreferences
So fun collab between @vwxyzjn , @plaguss , @kashif , @philschmid & @lewtun !
Open Source AI FTW!
> Using LLMs to improve other LLMs, at scale!
Built in collaboration with the H4 Hugging Face team, it's a 1M preferences dataset on top of the amazing @teknium 's dataset.
Dataset:
argilla/OpenHermesPreferences
The dataset is another example of open collaboration:
> The H4 team created responses with Mixtral using llm-swarm
> Argilla created responses with NousResearch Hermes-2-Yi-34B using distilabel
> The H4 ranked these responses + original response with PairRM from AllenAI, University of Southern California, Zhejiang University ( @yuchenlin @DongfuTingle and colleagues)
We hope this dataset will help the community's research efforts towards understanding the role of AI feedback for LLM alignment.
We're particularly excited about the ability of filtering specific subsets to improve LLM skills like math or reasoning.
Here's how easy it is to filter by subset:
ds = load_dataset("HuggingFaceH4/OpenHermesPreferences", split="train")
# Get the categories of the source dataset
# ['airoboros2.2', 'CamelAI', 'caseus_custom', ...]
sources = ds.unique("source")
# Filter for a subset
ds_filtered = ds.filter(lambda x : x["source"] in ["metamath", "EvolInstruct_70k"], num_proc=6)
As usual, all the scripts to reproduce this work are available and open to the community!
argilla/OpenHermesPreferences
So fun collab between @vwxyzjn , @plaguss , @kashif , @philschmid & @lewtun !
Open Source AI FTW!

dvilasuero
posted
an
update
about 1 year ago
Post
🤗 Data is better together!
Data is essential for training good AI systems. We believe that the amazing community built around open machine learning can also work on developing amazing datasets together.
To explore how this can be done, Argilla and Hugging Face are thrilled to announce a collaborative project where we’re asking Hugging Face community members to build a dataset consisting of LLM prompts collectively.
What are we doing?
Using an instance of Argilla — a powerful open-source data collaboration tool — hosted on the Hugging Face Hub, we are collecting ratings of prompts based on their quality.
How Can You Contribute?
It’s super simple to start contributing:
1. Sign up if you don’t have a Hugging Face account
2. Go to this Argilla Space and sign in: https://huggingface.co/spaces/DIBT/prompt-collective
3. Read the guidelines and start rating prompts!
You can also join the #data-is-better-together channel in the Hugging Face Discord.
Finally, to track the community progress we'll be updating this Gradio dashboard:
https://huggingface.co/spaces/DIBT/prompt-collective-dashboard
Data is essential for training good AI systems. We believe that the amazing community built around open machine learning can also work on developing amazing datasets together.
To explore how this can be done, Argilla and Hugging Face are thrilled to announce a collaborative project where we’re asking Hugging Face community members to build a dataset consisting of LLM prompts collectively.
What are we doing?
Using an instance of Argilla — a powerful open-source data collaboration tool — hosted on the Hugging Face Hub, we are collecting ratings of prompts based on their quality.
How Can You Contribute?
It’s super simple to start contributing:
1. Sign up if you don’t have a Hugging Face account
2. Go to this Argilla Space and sign in: https://huggingface.co/spaces/DIBT/prompt-collective
3. Read the guidelines and start rating prompts!
You can also join the #data-is-better-together channel in the Hugging Face Discord.
Finally, to track the community progress we'll be updating this Gradio dashboard:
https://huggingface.co/spaces/DIBT/prompt-collective-dashboard

dvilasuero
posted
an
update
about 1 year ago
Post
🚀 The Open Source AI community needs more open datasets for improving Open LLMs. Excited to share our new open dataset for boosting chat models:
🎉 Welcome Distilabel Capybara DPO, a multi-turn, high-quality preference dataset.
argilla/distilabel-capybara-dpo-7k-binarized
Why?
Best closed chat models are built on top of multi-turn dialogue preference data. The OSS community lacks these datasets. This dataset is the first in the series to close this gap.
Is this dataset useful?
To test this dataset, we've built our virtual launching partner:
🎉 Welcome CapybaraHermes, a preference tuned OpenHermes with increased second turn capabilities on MTBench
argilla/CapybaraHermes-2.5-Mistral-7B
As usual, models are the least important to us. We like to focus on the data. Our mission is to build and share high-quality datasets, sharing our methods in the open so the community can improve upon them.
That's why, we took some time to describe the full methodology on the dataset card, check it out and give us feedback! Data and methods are never perfect!
Finally, this is just a preview version and would love to collaborate with you to add more benchmarking results, what hyperparams work for DPO'ing models, what mix of datasets, etc.
Expect some more datasets in the coming weeks. Let's build the best data for AI, together.
🎉 Welcome Distilabel Capybara DPO, a multi-turn, high-quality preference dataset.
argilla/distilabel-capybara-dpo-7k-binarized
Why?
Best closed chat models are built on top of multi-turn dialogue preference data. The OSS community lacks these datasets. This dataset is the first in the series to close this gap.
Is this dataset useful?
To test this dataset, we've built our virtual launching partner:
🎉 Welcome CapybaraHermes, a preference tuned OpenHermes with increased second turn capabilities on MTBench
argilla/CapybaraHermes-2.5-Mistral-7B
As usual, models are the least important to us. We like to focus on the data. Our mission is to build and share high-quality datasets, sharing our methods in the open so the community can improve upon them.
That's why, we took some time to describe the full methodology on the dataset card, check it out and give us feedback! Data and methods are never perfect!
Finally, this is just a preview version and would love to collaborate with you to add more benchmarking results, what hyperparams work for DPO'ing models, what mix of datasets, etc.
Expect some more datasets in the coming weeks. Let's build the best data for AI, together.

philschmid
posted
an
update
about 1 year ago
Post
What's the best way to fine-tune open LLMs in 2024? Look no further! 👀 I am excited to share “How to Fine-Tune LLMs in 2024 with Hugging Face” using the latest research techniques, including Flash Attention, Q-LoRA, OpenAI dataset formats (messages), ChatML, Packing, all built with Hugging Face TRL. 🚀
It is created for consumer-size GPUs (24GB) covering the full end-to-end lifecycle with:
💡Define and understand use cases for fine-tuning
🧑🏻💻 Setup of the development environment
🧮 Create and prepare dataset (OpenAI format)
🏋️♀️ Fine-tune LLM using TRL and the SFTTrainer
🥇 Test and evaluate the LLM
🚀 Deploy for production with TGI
👉 https://www.philschmid.de/fine-tune-llms-in-2024-with-trl
Coming soon: Advanced Guides for multi-GPU/multi-Node full fine-tuning and alignment using DPO & KTO. 🔜
It is created for consumer-size GPUs (24GB) covering the full end-to-end lifecycle with:
💡Define and understand use cases for fine-tuning
🧑🏻💻 Setup of the development environment
🧮 Create and prepare dataset (OpenAI format)
🏋️♀️ Fine-tune LLM using TRL and the SFTTrainer
🥇 Test and evaluate the LLM
🚀 Deploy for production with TGI
👉 https://www.philschmid.de/fine-tune-llms-in-2024-with-trl
Coming soon: Advanced Guides for multi-GPU/multi-Node full fine-tuning and alignment using DPO & KTO. 🔜

dvilasuero
posted
an
update
about 1 year ago
Post
🔥 Less is more for DPO, high quality matters!
📢 Dropping our first open dataset and LLM of the year:
💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from Intel:
argilla/distilabel-intel-orca-dpo-pairs
🏛️ And a new OpenHermes fine-tune outperforming baselines with 54% less DPO pairs:
https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B
You can use this new dataset for your DPO tuning, just like this:
This will reduce the size of the original by 54% while giving you better quality preferences!
What should we build next?
📢 Dropping our first open dataset and LLM of the year:
💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from Intel:
argilla/distilabel-intel-orca-dpo-pairs
🏛️ And a new OpenHermes fine-tune outperforming baselines with 54% less DPO pairs:
https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B
You can use this new dataset for your DPO tuning, just like this:
from datasets import load_dataset
# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")
# use this:
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")
dataset = dataset.filter(
lambda r:
r["status"] != "tie" and
r["chosen_score"] >= 8 and
not r["in_gsm8k_train"]
)
This will reduce the size of the original by 54% while giving you better quality preferences!
What should we build next?

dvilasuero
posted
an
update
about 1 year ago
Post
👋 Hi there!
This is my very first post.
I'll use it to share some old news: a math preference dataset for DPO!
I created this dataset some time ago while we were developing distilabel (https://github.com/argilla-io/distilabel).
Some days ago we found out people are actually using it! So I'll use this post to explain how I built it in case it's useful for the community.
1. I used distilabel's SelfInstruct-inspired task to generate instructions about different math topics. I curated the instructions with Argilla (on Spaces!).
2. Then I used a distilabel Pipeline to build a preference dataset using gpt3.5 as generator and gpt4 as labeller. If I recall correctly I used our JudgeLM implementation (see https://distilabel.argilla.io/latest/technical-reference/tasks/#judgelmtask)
(see the screenshot with the dataset in the Argilla UI)
3. Then I just binarized into chosen, rejected pairs and voilà:
argilla/distilabel-math-preference-dpo
The funny thing is that I used this to do a second DPO run over Notus-7B. I hoped to see an improvement on math/reasoning skills but it actually improved in STEM and Humanities and did worse on Math 🤣 .
In conclusion, this dataset was only a quick experiement. I'm happy to see the community found it useful. Data for DPO and fine-tuning are still a mystery, let's unveil these mysteries in 2024 together!
Follow me for the most exciting datasets for LLMs (and maybe some great, small, efficient models). I plan to announce all Argilla open-source work here!
This is my very first post.
I'll use it to share some old news: a math preference dataset for DPO!
I created this dataset some time ago while we were developing distilabel (https://github.com/argilla-io/distilabel).
Some days ago we found out people are actually using it! So I'll use this post to explain how I built it in case it's useful for the community.
1. I used distilabel's SelfInstruct-inspired task to generate instructions about different math topics. I curated the instructions with Argilla (on Spaces!).
2. Then I used a distilabel Pipeline to build a preference dataset using gpt3.5 as generator and gpt4 as labeller. If I recall correctly I used our JudgeLM implementation (see https://distilabel.argilla.io/latest/technical-reference/tasks/#judgelmtask)
(see the screenshot with the dataset in the Argilla UI)
3. Then I just binarized into chosen, rejected pairs and voilà:
argilla/distilabel-math-preference-dpo
The funny thing is that I used this to do a second DPO run over Notus-7B. I hoped to see an improvement on math/reasoning skills but it actually improved in STEM and Humanities and did worse on Math 🤣 .
In conclusion, this dataset was only a quick experiement. I'm happy to see the community found it useful. Data for DPO and fine-tuning are still a mystery, let's unveil these mysteries in 2024 together!
Follow me for the most exciting datasets for LLMs (and maybe some great, small, efficient models). I plan to announce all Argilla open-source work here!