metadata

license: apache-2.0
language:
  - fr
library_name: transformers
tags:
  - t5
  - orfeo
  - pytorch
  - pictograms
  - translation
metrics:
  - bleu
widget:
  - text: je mange une pomme
    example_title: A simple sentence
  - text: je ne pense pas à toi
    example_title: Sentence with a negation
  - text: il y a 2 jours, les gendarmes ont vérifié ma licence
    example_title: Sentence with a polylexical term

t2p-t5-large-orféo

t2p-t5-large-orféo is a text-to-pictograms translation model built by fine-tuning the t5-large model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from ARASAAC). The model is used only for inference.

Training details

Datasets

The Propicto-orféo dataset is used, which was created from the CEFC-Orféo corpus. This dataset was presented in the research paper titled "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.

Split	Number of utterances
train	231,374
valid	28,796
test	29,009

Parameters

A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :

training_args = Seq2SeqTrainingArguments(
    output_dir="checkpoints_orfeo/",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=40,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True
)

Evaluation

The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis.

Results

Comparison to other translation models :

Model	validation	test
t2p-t5-large-orféo	85.2	85.8
t2p-nmt-orféo	87.2	87.4
t2p-mbart-large-cc25-orfeo	75.2	75.6
t2p-nllb-200-distilled-600M-orfeo	86.3	86.9

Environmental Impact

Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 16 hours in total.

Using t2p-t5-large-orféo model with HuggingFace transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

source_lang = "fr"
target_lang = "frp"
max_input_length = 128
max_target_length = 128

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
pred = tokenizer.decode(outputs[0], skip_special_tokens=True)

Linking and viewing the predicted sequence of tokens to the corresponding ARASAAC pictograms

Information

Language(s): French
License: Apache-2.0
Developed by: Cécile Macaire
Funded by
- GENCI-IDRIS (Grant 2023-AD011013625R1)
- PROPICTO ANR-20-CE93-0005
Authors
- Cécile Macaire
- Chloé Dion
- Emmanuelle Esperança-Rodier
- Benjamin Lecouteux
- Didier Schwab

Citation

If you use this model for your own research work, please cite as follows:

@inproceedings{macaire_jeptaln2024,
  title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
  author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
  url = {https://inria.hal.science/hal-04623007},
  booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
  address = {Toulouse, France},
  publisher = {{ATALA \& AFPC}},
  volume = {1 : articles longs et prises de position},
  pages = {22-35},
  year = {2024}
}