File size: 4,285 Bytes
2557360 1b853c5 2557360 330102a 2557360 330102a 2557360 330102a 2557360 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
license: apache-2.0
language:
- fr
library_name: transformers
tags:
- NMT
- commonvoice
- pytorch
- pictograms
- translation
metrics:
- bleu
inference: false
---
# t2p-nmt-commonvoice
*t2p-nmt-commonvoice* is a text-to-pictograms translation model built by training from scratch the [NMT](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
The model is used only for **inference**.
## Training details
The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md).
### Datasets
The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus.
This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
| **Split** | **Number of utterances** |
|:-----------:|:-----------------------:|
| train | 527,390 |
| valid | 16,124 |
| test | 16,120 |
### Parameters
This is the arguments in the training pipeline :
```bash
fairseq-train \
data-bin/commonvoice.tokenized.fr-frp \
--arch transformer_iwslt_de_en --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--save-dir exp_commonvoice/checkpoints/nmt_fr_frp_commonvoice \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--max-epoch 40 \
--keep-best-checkpoints 5 \
--keep-last-epochs 5
```
### Evaluation
The model was evaluated with BLEU, where we compared the reference pictogram translation with the model hypothesis.
### Results
Comparison to other translation models :
| **Model** | **validation** | **test** |
|:-----------:|:-----------------------:|:-----------------------:|
| **t2p-t5-large-commonvoice** | 86.3 | 86.5 |
| t2p-nmt-commonvoice | 86.0 | 82.6 |
| t2p-mbart-large-cc25-commonvoice | 72.3 | 72.3 |
| t2p-nllb-200-distilled-600M-commonvoice | **87.4** | **87.6** |
### Environmental Impact
Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 2 hours in total.
## Using t2p-nmt-commonvoice model
The scripts to use the *t2p-nmt-commonvoice* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).
## Information
- **Language(s):** French
- **License:** Apache-2.0
- **Developed by:** Cécile Macaire
- **Funded by**
- GENCI-IDRIS (Grant 2023-AD011013625R1)
- PROPICTO ANR-20-CE93-0005
- **Authors**
- Cécile Macaire
- Chloé Dion
- Emmanuelle Esperança-Rodier
- Benjamin Lecouteux
- Didier Schwab
## Citation
If you use this model for your own research work, please cite as follows:
```bibtex
@inproceedings{macaire_jeptaln2024,
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
url = {https://inria.hal.science/hal-04623007},
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
address = {Toulouse, France},
publisher = {{ATALA \& AFPC}},
volume = {1 : articles longs et prises de position},
pages = {22-35},
year = {2024}
}
```
|