EIStakovskii's picture
Update README.md
6c8e91a
|
raw
history blame
2.75 kB
---
language: de # <-- my language
widget:
- text: "Guten morgen, meine Liebe"
example_title: "NOT TOXIC 1"
- text: "Ich scheiß drauf."
example_title: "TOXIC 1"
- text: "Ich liebe dich"
example_title: "NOT TOXIC 2"
- text: "Ich hab die Schnauze voll von diesen Irren."
example_title: "TOXIC 2"
- text: "Ich wünsche Ihnen einen schönen Tag!"
example_title: "NOT TOXIC 3"
- text: "Nigger"
example_title: "TOXIC 3"
- text: "Du bist schon wieder zu spät!"
example_title: "NOT TOXIC 4"
- text: "Beweg deinen AArschhh hier rüber"
example_title: "TOXIC 4"
license: other
---
## Description
NB: this version of the model is the improved version of [EIStakovskii/german_toxicity_classifier_plus](https://huggingface.co/EIStakovskii/german_toxicity_classifier_plus).
To see the source code of training and the data please follow [the github link](https://github.com/eistakovskii/NLP_projects/tree/main/TEXT_CLASSIFICATION).
This model was trained for toxicity labeling.
The model was fine-tuned based off [the dbmdz/bert-base-german-cased model](https://huggingface.co/dbmdz/bert-base-german-cased).
To use the model:
```python
from transformers import pipeline
classifier = pipeline("text-classification", model = 'EIStakovskii/german_toxicity_classifier_plus_v2')
print(classifier("Verpiss dich von hier"))
```
## Metrics (at validation):
epoch|step|eval_accuracy|eval_f1|eval_loss
-|-|-|-|-
0.8|1200|0.9132176234979973|0.9113535629048755|0.24135465919971466
## Comparison against Perspective
This model was compared against the Google's [Perspective API](https://developers.perspectiveapi.com/s/?language=en_US) that similarly detects toxicity.
Two models were tested on two datasets: the size of [200 sentences](https://github.com/eistakovskii/NLP_projects/blob/main/TEXT_CLASSIFICATION/data/Toxicity_Classifiers/DE_FR/test/test_de_200.csv) and [400 sentences](https://github.com/eistakovskii/NLP_projects/blob/main/TEXT_CLASSIFICATION/data/Toxicity_Classifiers/DE_FR/test/test_de_400.csv).
The first one (arguably harder) was collected from the sentences of the [JigSaw](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data) and [DeTox](https://github.com/hdaSprachtechnologie/detox) datasets.
The second one (easier) was collected from the combination of sources: both from JigSaw and DeTox as well as [Paradetox](https://github.com/s-nlp/multilingual_detox/tree/main/data) translations and sentences extracted from [Reverso Context](https://context.reverso.net/translation/) by keywords.
# german_toxicity_classifier_plus_v2
size|accuracy|f1
-|-|-
200|0.767|0.787
400|0.9650|0.9651
# Perspective
size|accuracy|f1
-|-|-
200|0.834|0.820
400|0.892|0.885