opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa

Table of Contents

Model Details

Neural machine translation model for translating from Afro-Asiatic languages (afa) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>eng<< Anta i ak-d-yennan ur yerbiḥ ara Tom?",
    ">>fra<< Iselman d aɣbalu axatar i wučči n yemdanen."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Who told you that he didn't?
#     L'eau est une source importante de nourriture pour les gens.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa")
print(pipe(">>eng<< Anta i ak-d-yennan ur yerbiḥ ara Tom?"))

# expected output: Who told you that he didn't?

Training

Evaluation

langpair testset chr-F BLEU #sent #words
ara-deu tatoeba-test-v2021-08-07 0.61039 41.7 1209 8371
ara-eng tatoeba-test-v2021-08-07 5.430 0.0 10305 76975
ara-fra tatoeba-test-v2021-08-07 0.56120 38.8 1569 11066
ara-spa tatoeba-test-v2021-08-07 0.62567 43.7 1511 9708
heb-deu tatoeba-test-v2021-08-07 0.63131 42.4 3090 25101
heb-eng tatoeba-test-v2021-08-07 0.64960 49.2 10519 77427
heb-fra tatoeba-test-v2021-08-07 0.64348 46.3 3281 26123
heb-por tatoeba-test-v2021-08-07 0.63350 43.2 719 5335
mlt-eng tatoeba-test-v2021-08-07 0.66653 51.0 203 1165
amh-eng flores101-devtest 0.47357 21.0 1012 24721
amh-fra flores101-devtest 0.43155 16.2 1012 28343
amh-por flores101-devtest 0.42109 15.1 1012 26519
ara-deu flores101-devtest 0.51110 20.4 1012 25094
ara-fra flores101-devtest 0.56934 29.7 1012 28343
ara-por flores101-devtest 0.55727 28.2 1012 26519
ara-spa flores101-devtest 0.48350 19.5 1012 29199
hau-eng flores101-devtest 0.46804 21.6 1012 24721
hau-fra flores101-devtest 0.41827 15.9 1012 28343
heb-eng flores101-devtest 0.62422 36.6 1012 24721
mlt-eng flores101-devtest 0.72390 49.1 1012 24721
mlt-fra flores101-devtest 0.60840 34.7 1012 28343
mlt-por flores101-devtest 0.59863 31.8 1012 26519
acm-deu flores200-devtest 0.48947 17.6 1012 25094
acm-eng flores200-devtest 0.56799 28.5 1012 24721
acm-fra flores200-devtest 0.53577 26.1 1012 28343
acm-por flores200-devtest 0.52441 23.9 1012 26519
acm-spa flores200-devtest 0.46985 18.2 1012 29199
amh-deu flores200-devtest 0.41553 12.6 1012 25094
amh-eng flores200-devtest 0.49333 22.5 1012 24721
amh-fra flores200-devtest 0.44890 17.8 1012 28343
amh-por flores200-devtest 0.43771 16.5 1012 26519
apc-deu flores200-devtest 0.47480 16.0 1012 25094
apc-eng flores200-devtest 0.56075 28.1 1012 24721
apc-fra flores200-devtest 0.52325 24.6 1012 28343
apc-por flores200-devtest 0.51055 22.9 1012 26519
apc-spa flores200-devtest 0.45634 17.2 1012 29199
arz-deu flores200-devtest 0.45844 14.1 1012 25094
arz-eng flores200-devtest 0.52534 22.7 1012 24721
arz-fra flores200-devtest 0.50336 21.8 1012 28343
arz-por flores200-devtest 0.48741 20.0 1012 26519
arz-spa flores200-devtest 0.44516 15.8 1012 29199
hau-eng flores200-devtest 0.48137 23.4 1012 24721
hau-fra flores200-devtest 0.42981 17.2 1012 28343
hau-por flores200-devtest 0.41385 15.7 1012 26519
heb-deu flores200-devtest 0.53482 22.8 1012 25094
heb-eng flores200-devtest 0.63368 38.0 1012 24721
heb-fra flores200-devtest 0.58417 32.6 1012 28343
heb-por flores200-devtest 0.57140 30.7 1012 26519
mlt-eng flores200-devtest 0.73415 51.1 1012 24721
mlt-fra flores200-devtest 0.61626 35.8 1012 28343
mlt-spa flores200-devtest 0.50534 21.8 1012 29199
som-eng flores200-devtest 0.42764 17.7 1012 24721
tir-por flores200-devtest 2.931 0.0 1012 26519
hau-eng newstest2021 0.43744 15.5 997 27372
amh-eng ntrex128 0.42042 15.0 1997 47673
hau-eng ntrex128 0.50349 26.1 1997 47673
hau-fra ntrex128 0.41837 15.8 1997 53481
hau-por ntrex128 0.40851 15.3 1997 51631
hau-spa ntrex128 0.43376 18.5 1997 54107
heb-deu ntrex128 0.49482 17.7 1997 48761
heb-eng ntrex128 0.59241 31.3 1997 47673
heb-fra ntrex128 0.52180 24.0 1997 53481
heb-por ntrex128 0.51248 23.2 1997 51631
mlt-spa ntrex128 0.57078 30.9 1997 54107
som-eng ntrex128 0.49187 24.3 1997 47673
som-fra ntrex128 0.41236 15.1 1997 53481
som-por ntrex128 0.41550 15.2 1997 51631
som-spa ntrex128 0.43278 17.6 1997 54107
tir-eng tico19-test 2.655 0.0 2100 56824

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: a0ea3b3
  • port time: Mon Oct 7 17:08:30 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
87
Safetensors
Model size
239M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa

Evaluation results