language:
- cy
- en
license: apache-2.0
pipeline_tag: translation
tags:
- translation
- marian
metrics:
- bleu
widget:
- text: >-
Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg
erbyn y flwyddyn 2020."
model-index:
- name: mt-general-cy-en
results:
- task:
name: Translation
type: translation
metrics:
- type: bleu
value: 54
mt-general-cy-en
A general language translation model for translating between Welsh and English.
This model was trained using custom DVC pipeline employing Marian NMT, the datasets prepared were generated from the following sources:
The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form of text and TMX from the datasets described above. The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into split into 10 training and validation sets.
Evaluation
The BLEU evaluation score was produced using the python library SacreBLEU.
Usage
Ensure you have the prerequisite python libraries installed:
pip install transformers sentencepiece
import trnasformers
model_id = "mgrbyte/mt-general-cy-en"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
"Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
)
print(translated["translation_text"])