---
language:
- cy
- en
license: apache-2.0
pipeline_tag: translation
tags:
- translation
- marian
metrics:
  - bleu
widget:
 - text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
model-index:
- name: mt-general-cy-en
  results:
  - task:
      name: Translation
      type: translation
    metrics:
      - type: bleu
        value: 54
---
# mt-general-cy-en
A general language translation model for translating between Welsh and English.

This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), 
the datasets prepared were generated from the following sources:
 - [UK Government Legislation data](https://www.legislation.gov.uk)
 - [OPUS-cy-en](https://opus.nlpl.eu/)
 - [Cofnod Y Cynulliad](https://record.assembly.wales/)
 - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form 
of text and TMX from the datasets described above.
The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
split into 10 training and validation sets.

## Evaluation

The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu).
## Usage

Ensure you have the prerequisite python libraries installed:

```bsdh
pip install transformers sentencepiece
```

```python
import trnasformers
model_id = "mgrbyte/mt-general-cy-en"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
   "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
)
print(translated["translation_text"])
```