metadata

language:
  - cy
  - en
license: apache-2.0
pipeline_tag: translation
tags:
  - translation
  - marian
metrics:
  - bleu
widget:
  - text: >-
      Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg
      erbyn y flwyddyn 2020."
model-index:
  - name: mt-general-cy-en
    results:
      - task:
          name: Translation
          type: translation
        metrics:
          - type: bleu
            value: 54

mt-general-cy-en

A general language translation model for translating between Welsh and English.

This model was trained using custom DVC pipeline employing Marian NMT, the datasets prepared were generated from the following sources:

The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form of text and TMX from the datasets described above. The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into split into 10 training and validation sets.

Evaluation

The BLEU evaluation score was produced using the python library SacreBLEU.

Usage

Ensure you have the prerequisite python libraries installed:

pip install transformers sentencepiece

import trnasformers
model_id = "mgrbyte/mt-general-cy-en"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
   "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
)
print(translated["translation_text"])