Commit
·
70cd4d1
1
Parent(s):
442c8ce
Update README.md
Browse files
README.md
CHANGED
@@ -11,15 +11,17 @@ datasets:
|
|
11 |
|
12 |
# 🚮 🦙 Alpaca GarbageCollector
|
13 |
|
14 |
-
A cross-lingual SetFit model to **detect bad instructions from Alpaca Datasets** and
|
15 |
-
`GarbageCollector` can greatly speed up the validation of
|
16 |
|
|
|
|
|
|
|
17 |
|
18 |
<div style="text-align:center">
|
19 |
<img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
|
20 |
</div>
|
21 |
|
22 |
-
The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
|
23 |
|
24 |
It's a binary classifier with two labels:
|
25 |
|
|
|
11 |
|
12 |
# 🚮 🦙 Alpaca GarbageCollector
|
13 |
|
14 |
+
A cross-lingual SetFit model to **detect bad instructions from Alpaca Datasets** and other instruction-following datasets.
|
15 |
+
`GarbageCollector` can greatly speed up the validation of these Datasets across many languages, flagging examples that need to be fixed or simply discarded.
|
16 |
|
17 |
+
Data quality is key for LLMs, but open-source LLMs are being built with data of "unknown" quality. This model can help practitioners to find and fix frequent issues (e.g., the model hallucinating stock prices, describing non-existing images, etc.)
|
18 |
+
|
19 |
+
The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset labeled with [Argilla](https://www.argilla.io/). It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
|
20 |
|
21 |
<div style="text-align:center">
|
22 |
<img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
|
23 |
</div>
|
24 |
|
|
|
25 |
|
26 |
It's a binary classifier with two labels:
|
27 |
|