|
--- |
|
language: de |
|
widget: |
|
- text: "Guten morgen, meine Liebe" |
|
example_title: "NOT TOXIC 1" |
|
- text: "Ich scheiß drauf." |
|
example_title: "TOXIC 1" |
|
- text: "Ich liebe dich" |
|
example_title: "NOT TOXIC 2" |
|
- text: "Ich hab die Schnauze voll von diesen Irren." |
|
example_title: "TOXIC 2" |
|
- text: "Ich wünsche Ihnen einen schönen Tag!" |
|
example_title: "NOT TOXIC 3" |
|
- text: "Nigger" |
|
example_title: "TOXIC 3" |
|
- text: "Du bist schon wieder zu spät!" |
|
example_title: "NOT TOXIC 4" |
|
- text: "Beweg deinen AArschhh hier rüber" |
|
example_title: "TOXIC 4" |
|
|
|
license: other |
|
--- |
|
## Description |
|
NB: this version of the model is the improved version of [EIStakovskii/german_toxicity_classifier_plus](https://huggingface.co/EIStakovskii/german_toxicity_classifier_plus). |
|
To see the source code of training and the data please follow [the github link](https://github.com/eistakovskii/NLP_projects/tree/main/TEXT_CLASSIFICATION). |
|
|
|
This model was trained for toxicity labeling. |
|
|
|
The model was fine-tuned based off [the dbmdz/bert-base-german-cased model](https://huggingface.co/dbmdz/bert-base-german-cased). |
|
|
|
To use the model: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
classifier = pipeline("text-classification", model = 'EIStakovskii/german_toxicity_classifier_plus_v2') |
|
|
|
print(classifier("Verpiss dich von hier")) |
|
|
|
``` |
|
|
|
## Metrics (at validation): |
|
|
|
epoch|step|eval_accuracy|eval_f1|eval_loss |
|
-|-|-|-|- |
|
0.8|1200|0.9132176234979973|0.9113535629048755|0.24135465919971466 |
|
|
|
## Comparison against Perspective |
|
|
|
This model was compared against the Google's [Perspective API](https://developers.perspectiveapi.com/s/?language=en_US) that similarly detects toxicity. |
|
Two models were tested on two datasets: the size of [200 sentences](https://github.com/eistakovskii/NLP_projects/blob/main/TEXT_CLASSIFICATION/data/Toxicity_Classifiers/DE_FR/test/test_de_200.csv) and [400 sentences](https://github.com/eistakovskii/NLP_projects/blob/main/TEXT_CLASSIFICATION/data/Toxicity_Classifiers/DE_FR/test/test_de_400.csv). |
|
The first one (arguably harder) was collected from the sentences of the [JigSaw](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data) and [DeTox](https://github.com/hdaSprachtechnologie/detox) datasets. |
|
The second one (easier) was collected from the combination of sources: both from JigSaw and DeTox as well as [Paradetox](https://github.com/s-nlp/multilingual_detox/tree/main/data) translations and sentences extracted from [Reverso Context](https://context.reverso.net/translation/) by keywords. |
|
|
|
# german_toxicity_classifier_plus_v2 |
|
size|accuracy|f1 |
|
-|-|- |
|
200|0.767|0.787 |
|
400|0.9650|0.9651 |
|
|
|
# Perspective |
|
size|accuracy|f1 |
|
-|-|- |
|
200|0.834|0.820 |
|
400|0.892|0.885 |