--- language: - cy - en license: apache-2.0 pipeline_tag: translation tags: - translation - marian metrics: - bleu widget: - text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020." model-index: - name: mt-general-cy-en results: - task: name: Translation type: translation metrics: - type: bleu value: 54 --- # mt-general-cy-en A general language translation model for translating between Welsh and English. This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), the datasets prepared were generated from the following sources: - [UK Government Legislation data](https://www.legislation.gov.uk) - [OPUS-cy-en](https://opus.nlpl.eu/) - [Cofnod Y Cynulliad](https://record.assembly.wales/) - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form of text and TMX from the datasets described above. The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into split into 10 training and validation sets. ## Evaluation The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu). ## Usage Ensure you have the prerequisite python libraries installed: ```bsdh pip install transformers sentencepiece ``` ```python import trnasformers model_id = "mgrbyte/mt-general-cy-en" tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id) translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer) translated = translate( "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020." ) print(translated["translation_text"]) ```