|
--- |
|
language: |
|
- cy |
|
- en |
|
license: apache-2.0 |
|
pipeline_tag: translation |
|
tags: |
|
- translation |
|
- marian |
|
metrics: |
|
- bleu |
|
widget: |
|
- text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020." |
|
model-index: |
|
- name: mt-general-cy-en |
|
results: |
|
- task: |
|
name: Translation |
|
type: translation |
|
metrics: |
|
- type: bleu |
|
value: 54 |
|
--- |
|
# mt-general-cy-en |
|
A general language translation model for translating between Welsh and English. |
|
|
|
This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), |
|
the datasets prepared were generated from the following sources: |
|
- [UK Government Legislation data](https://www.legislation.gov.uk) |
|
- [OPUS-cy-en](https://opus.nlpl.eu/) |
|
- [Cofnod Y Cynulliad](https://record.assembly.wales/) |
|
- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) |
|
|
|
The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form |
|
of text and TMX from the datasets described above. |
|
The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into |
|
split into 10 training and validation sets. |
|
|
|
## Evaluation |
|
|
|
The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu). |
|
## Usage |
|
|
|
Ensure you have the prerequisite python libraries installed: |
|
|
|
```bsdh |
|
pip install transformers sentencepiece |
|
``` |
|
|
|
```python |
|
import trnasformers |
|
model_id = "mgrbyte/mt-general-cy-en" |
|
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) |
|
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer) |
|
translated = translate( |
|
"Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020." |
|
) |
|
print(translated["translation_text"]) |
|
``` |
|
|