mt-general-cy-en / README.md
mgrbyte's picture
Changed wording.
53f2c82
---
language:
- cy
- en
license: apache-2.0
pipeline_tag: translation
tags:
- translation
- marian
metrics:
- bleu
widget:
- text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
model-index:
- name: mt-general-cy-en
results:
- task:
name: Translation
type: translation
metrics:
- type: bleu
value: 54
---
# mt-general-cy-en
A general language translation model for translating between Welsh and English.
This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
the datasets prepared were generated from the following sources:
- [UK Government Legislation data](https://www.legislation.gov.uk)
- [OPUS-cy-en](https://opus.nlpl.eu/)
- [Cofnod Y Cynulliad](https://record.assembly.wales/)
- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)
The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form
of text and TMX from the datasets described above.
The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
split into 10 training and validation sets.
## Evaluation
The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu).
## Usage
Ensure you have the prerequisite python libraries installed:
```bsdh
pip install transformers sentencepiece
```
```python
import trnasformers
model_id = "mgrbyte/mt-general-cy-en"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
"Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
)
print(translated["translation_text"])
```