dvilasuero HF staff commited on
Commit
70cd4d1
·
1 Parent(s): 442c8ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -11,15 +11,17 @@ datasets:
11
 
12
  # 🚮 🦙 Alpaca GarbageCollector
13
 
14
- A cross-lingual SetFit model to **detect bad instructions from Alpaca Datasets** and potentially other instruction-following datasets.
15
- `GarbageCollector` can greatly speed up the validation of Alpaca Datasets across many languages, flagging examples that need to be fixed or simply discarded.
16
 
 
 
 
17
 
18
  <div style="text-align:center">
19
  <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
20
  </div>
21
 
22
- The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
23
 
24
  It's a binary classifier with two labels:
25
 
 
11
 
12
  # 🚮 🦙 Alpaca GarbageCollector
13
 
14
+ A cross-lingual SetFit model to **detect bad instructions from Alpaca Datasets** and other instruction-following datasets.
15
+ `GarbageCollector` can greatly speed up the validation of these Datasets across many languages, flagging examples that need to be fixed or simply discarded.
16
 
17
+ Data quality is key for LLMs, but open-source LLMs are being built with data of "unknown" quality. This model can help practitioners to find and fix frequent issues (e.g., the model hallucinating stock prices, describing non-existing images, etc.)
18
+
19
+ The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset labeled with [Argilla](https://www.argilla.io/). It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
20
 
21
  <div style="text-align:center">
22
  <img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
23
  </div>
24
 
 
25
 
26
  It's a binary classifier with two labels:
27