File size: 4,518 Bytes
d915fbd 84cd03f 3f18d74 d915fbd 9725eed d915fbd 0e1dea6 07920fc 0e1dea6 07920fc d915fbd 0e1dea6 07920fc d915fbd 07920fc d915fbd 07920fc 0e1dea6 07920fc d915fbd 0e1dea6 d915fbd 9725eed 3f18d74 9725eed 3f18d74 d915fbd 0e1dea6 bca387e d915fbd 0e1dea6 d915fbd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
license: apache-2.0
language:
- fr
library_name: transformers
tags:
- t5
- orfeo
- pytorch
- pictograms
- translation
metrics:
- bleu
widget:
- text: "je mange une pomme"
example_title: "A simple sentence"
- text: "je ne pense pas à toi"
example_title: "Sentence with a negation"
- text: "il y a 2 jours, les gendarmes ont vérifié ma licence"
example_title: "Sentence with a polylexical term"
---
# t2p-t5-large-orféo
*t2p-t5-large-orféo* is a text-to-pictograms translation model built by fine-tuning the [t5-large](https://huggingface.co/google-t5/t5-large) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
## Training details
### Datasets
The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
| **Split** | **Number of utterances** |
|:-----------:|:-----------------------:|
| train | 231,374 |
| valid | 28,796 |
| test | 29,009 |
### Parameters
A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :
```python
training_args = Seq2SeqTrainingArguments(
output_dir="checkpoints_orfeo/",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=40,
predict_with_generate=True,
fp16=True,
load_best_model_at_end=True
)
```
### Evaluation
The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.
### Results
Comparison to other translation models :
| **Model** | **validation** | **test** |
|:-----------:|:-----------------------:|:-----------------------:|
| **t2p-t5-large-orféo** | 85.2 | 85.8 |
| t2p-nmt-orféo | **87.2** | **87.4** |
| t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
| t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 |
### Environmental Impact
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 16 hours in total.
## Using t2p-t5-large-orféo model with HuggingFace transformers
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
source_lang = "fr"
target_lang = "frp"
max_input_length = 128
max_target_length = 128
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Linking and viewing the predicted sequence of tokens to the corresponding ARASAAC pictograms
## Information
- **Language(s):** French
- **License:** Apache-2.0
- **Developed by:** Cécile Macaire
- **Funded by**
- GENCI-IDRIS (Grant 2023-AD011013625R1)
- PROPICTO ANR-20-CE93-0005
- **Authors**
- Cécile Macaire
- Chloé Dion
- Emmanuelle Esperança-Rodier
- Benjamin Lecouteux
- Didier Schwab
## Citation
If you use this model for your own research work, please cite as follows:
```bibtex
@inproceedings{macaire_jeptaln2024,
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
url = {https://inria.hal.science/hal-04623007},
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
address = {Toulouse, France},
publisher = {{ATALA \& AFPC}},
volume = {1 : articles longs et prises de position},
pages = {22-35},
year = {2024}
}
``` |