Northern Frisian translation model
This is an NLLB-200-600M model fine-tuned for translating between German and the Northern Frisian dialects of Mooringer Frasch and Wiringhiirder Freesk following this great blogpost.
While the additional data introduced with the new dialect has improved the model's performance for translations German <-> Mooring compared to nllb-deu-moo, the extended training has at the same time degraded the performance for other languages. For example, translating English to Mooring still works relatively well while conversely translating Mooring to English does not.
Data
Mooring <-> German:
The Mooring dataset for finetuning consisted of 9339 sentence pairs. Most examples (roughly 5100) were taken directly from "Rüm Hart" published by the Nordfriisk Instituut. For sentence splitting the python sentence-splitting library was used. The splitting wasn't perfect, especially in cases of direct speech, so that manual re-alignment and further splitting was necessary. Further, the texts about larks from Föögle önj Nordfraschlönj, Marie Tångeberg, 1992 were added, a translation of the story Bulemanns Haus by Theodor Storm, as well as roughly 3000 examples taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988. Finally, a little under 180 very simple self-written examples were used as evaluation data set.Wiringhiirder <-> German:
The Wiringhiirder dataset consisted of 7529 sentence pairs taken from the books "Di muon fuon e halie" and "Di tofel" by Peter Jensen published by the Nordfriisk Instituut. Similar measures were taken as for Rüm Hart above. For evaluation sentences were collected from Wikipedia, however the evaluation set remains very small and is barely enough to detect overfitting.
Usage
How to use the model:
!pip install transformers==4.33
from transformers import AutoModelForSeq2SeqLM, NllbTokenizer
def create_tokenizer_with_new_langs(model_id, new_langs):
tokenizer = NllbTokenizer.from_pretrained(model_id)
for lang in new_langs:
old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
new_token_id = old_len - 1
if new_lang in tokenizer.added_tokens_encoder:
new_token_id = tokenizer.added_tokens_encoder[new_lang] - 1
tokenizer.lang_code_to_id[new_lang] = new_token_id
tokenizer.id_to_lang_code[new_token_id] = new_lang
# always move "mask" to the last position
tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
if new_lang not in tokenizer._additional_special_tokens:
tokenizer._additional_special_tokens.append(new_lang)
# clear the added token encoder; otherwise a new token may end up there by mistake
tokenizer.added_tokens_encoder = {}
tokenizer.added_tokens_decoder = {}
return tokenizer
def translate(
text,
tokenizer,
model,
src_lang='moo_Latn',
tgt_lang='deu_Latn',
a=32,
b=3,
max_input_length=1024,
num_beams=4,
**kwargs
):
tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
result = model.generate(
**inputs.to(model.device),
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
num_beams=num_beams,
**kwargs
)
return tokenizer.batch_decode(result, skip_special_tokens=True)
path = "CmdCody/nllb-deu-frr"
tokenizer = create_tokenizer_with_new_langs(path, ['moo_Latn', 'wir_Latn'])
model = AutoModelForSeq2SeqLM.from_pretrained(path)
translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
Training
The model was trained in a Google Colab notebook for 4 epochs and a batch size of 16 following the above mentioned blog post with two notable adaptations:
- The data iteration was changed to make sure that the model sees each example in the dataset exactly once per epoch.
- After tokenization and batching the complete data set is shuffled before each epoch so that all translation directions are mixed. However, each batch only contains examples for one direction.
Evaluation
Metrics on the evaluation data sets:
Bleu | ChrF++ | |
---|---|---|
Moo -> Deu | 55.78 | 70.73 |
Deu -> Moo | 50.19 | 67.76 |
Wir -> Deu | 67.22 | 80.16 |
Deu -> Wir | 42.35 | 61.08 |
Note: As mentioned above the Wiringhiirder evaluation set is very small and the resulting metrics should not be compared with the Mooring metrics.
- Downloads last month
- 10
Model tree for CmdCody/nllb-deu-frr
Base model
facebook/nllb-200-distilled-600M