File size: 2,748 Bytes
d1aeebe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae204cf
fe6382f
780df4a
 
ae204cf
 
 
780df4a
f1cdf19
6c8e91a
 
 
 
 
 
 
 
 
 
 
fe6382f
f1cdf19
 
 
ab9d2b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
language: de        # <-- my language
widget:
 - text: "Guten morgen, meine Liebe"
   example_title: "NOT TOXIC 1"
 - text: "Ich scheiß drauf."
   example_title: "TOXIC 1"
 - text: "Ich liebe dich"
   example_title: "NOT TOXIC 2" 
 - text: "Ich hab die Schnauze voll von diesen Irren."
   example_title: "TOXIC 2" 
 - text: "Ich wünsche Ihnen einen schönen Tag!"   
   example_title: "NOT TOXIC 3"
 - text: "Nigger"
   example_title: "TOXIC 3" 
 - text: "Du bist schon wieder zu spät!"   
   example_title: "NOT TOXIC 4"
 - text: "Beweg deinen AArschhh hier rüber"
   example_title: "TOXIC 4" 
   
license: other
---
## Description
NB: this version of the model is the improved version of [EIStakovskii/german_toxicity_classifier_plus](https://huggingface.co/EIStakovskii/german_toxicity_classifier_plus).
To see the source code of training and the data please follow [the github link](https://github.com/eistakovskii/NLP_projects/tree/main/TEXT_CLASSIFICATION).

This model was trained for toxicity labeling.

The model was fine-tuned based off [the dbmdz/bert-base-german-cased model](https://huggingface.co/dbmdz/bert-base-german-cased).

To use the model:

```python
from transformers import pipeline

classifier = pipeline("text-classification", model = 'EIStakovskii/german_toxicity_classifier_plus_v2')

print(classifier("Verpiss dich von hier"))

```

## Metrics (at validation):

epoch|step|eval_accuracy|eval_f1|eval_loss
-|-|-|-|-
0.8|1200|0.9132176234979973|0.9113535629048755|0.24135465919971466

## Comparison against Perspective

This model was compared against the Google's [Perspective API](https://developers.perspectiveapi.com/s/?language=en_US) that similarly detects toxicity. 
Two models were tested on two datasets: the size of [200 sentences](https://github.com/eistakovskii/NLP_projects/blob/main/TEXT_CLASSIFICATION/data/Toxicity_Classifiers/DE_FR/test/test_de_200.csv) and [400 sentences](https://github.com/eistakovskii/NLP_projects/blob/main/TEXT_CLASSIFICATION/data/Toxicity_Classifiers/DE_FR/test/test_de_400.csv). 
The first one (arguably harder) was collected from the sentences of the [JigSaw](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data) and [DeTox](https://github.com/hdaSprachtechnologie/detox) datasets.
The second one (easier) was collected from the combination of sources: both from JigSaw and DeTox as well as [Paradetox](https://github.com/s-nlp/multilingual_detox/tree/main/data) translations and sentences extracted from [Reverso Context](https://context.reverso.net/translation/) by keywords.

# german_toxicity_classifier_plus_v2
size|accuracy|f1 
-|-|-
200|0.767|0.787
400|0.9650|0.9651

# Perspective
size|accuracy|f1 
-|-|-
200|0.834|0.820
400|0.892|0.885