argilla
/

alpaca-garbage-collector-multilingual

Text Classification

sentence-transformers

Model card Files Files and versions Community

dvilasuero HF staff commited on Apr 4, 2023

Commit

70cd4d1

·

1 Parent(s): 442c8ce

Update README.md

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -11,15 +11,17 @@ datasets:
 # 🚮 🦙 Alpaca GarbageCollector
-A cross-lingual SetFit model to **detect bad instructions from Alpaca Datasets** and potentially other instruction-following datasets.
-`GarbageCollector` can greatly speed up the validation of Alpaca Datasets across many languages, flagging examples that need to be fixed or simply discarded.
 <div style="text-align:center">
     <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
 </div>
-The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
 It's a binary classifier with two labels:

 # 🚮 🦙 Alpaca GarbageCollector
+A cross-lingual SetFit model to **detect bad instructions from Alpaca Datasets** and other instruction-following datasets.
+`GarbageCollector` can greatly speed up the validation of these Datasets across many languages, flagging examples that need to be fixed or simply discarded.
+Data quality is key for  LLMs, but open-source LLMs are being built with data of "unknown" quality. This model can help practitioners to find and fix frequent issues (e.g., the model hallucinating stock prices, describing non-existing images, etc.)
+The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset labeled with [Argilla](https://www.argilla.io/). It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
 <div style="text-align:center">
     <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
 </div>
 It's a binary classifier with two labels: