File size: 4,770 Bytes
081ac4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
896b6ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
081ac4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
language:
- de
pipeline_tag: text-generation
tags:
- german
- deutsch
- simplification
- vereinfachung
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

We fine-tuned [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with a set of ca. 800 newspaper articles which have been simplified by the Austrian Press Agency. 
Our aim was to have a model which can simplify German-language text.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** Members of the [Public Interest AI research group](https://publicinterest.ai/), [HIIG Berlin](https://www.hiig.de/)
- **Model type:** simplification model, text generation
- **Language(s) (NLP):** German
- **License:** Apache 2.0
- **Finetuned from model:** meta-llama/Meta-Llama-3-8B-Instruct

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/fhewett/simba
<!-- - **Paper [optional]:** [More Information Needed] -->
- **Project website:** https://publicinterest.ai/tool/simba

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

This model works best for simplifying German-language newspaper articles (news items, not commentaries or editorials). It may work for other types of texts.

### Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
We have fine-tuned using only newspaper articles. We have not yet performed extensive out-of-domain testing, but believe that the model's capabilities could be improved by fine-tuning on more diverse data. Contact us if you have a dataset which you think could work (parallel texts, German standard & German simplified).

<!-- ### Out-of-Scope Use -->

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

As with most text generation models, the model sometimes produces information that is incorrect. 

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Please check manually that your output text corresponds to the input text, as factual inconsistencies may have arisen.

## How to Get Started with the Model

We offer two tools to interact with our model: an online app and a browser extension. They can be viewed and used [here](https://publicinterest.ai/tool/simba?lang=en).

Alternatively, to load the model using transformers:

```
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("hiig-piai/simba_best_092024")
model = AutoModelForCausalLM.from_pretrained("hiig-piai/simba_best_092024", torch_dtype=torch.float16).to(device)
```

We used the following prompt at inference to test our model:

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Du bist ein hilfreicher Assistent und hilfst dem User, Texte besser zu verstehen.<|eot_id|><|start_header_id|>user<|end_header_id|>
Kannst du bitte den folgenden Text zusammenfassen und sprachlich auf ein A2-Niveau in Deutsch vereinfachen? Schreibe maximal 5 Sätze.
{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

A sample of the data used to train our model can be found [here](https://github.com/fhewett/apa-rst/tree/main/original_texts).

#### Training Hyperparameters

- **Training regime:** Our training script can be found [here](https://github.com/fhewett/simba/blob/main/models/train_simba.py). <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

<!-- #### Speeds, Sizes, Times [optional]  -->

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

#### Summary 


<!-- ## Citation [optional]

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]-->

## Model Card Contact

simba -at- hiig.de