File size: 4,884 Bytes
2557360
 
 
 
 
 
 
 
 
 
 
 
c3e9edc
1b853c5
2557360
 
 
 
 
 
 
 
 
330102a
 
2557360
 
 
 
 
 
 
 
 
 
 
 
330102a
2557360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3e9edc
2557360
dc9c459
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2557360
 
 
 
 
c3e9edc
 
2557360
 
 
 
 
330102a
2557360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: apache-2.0
language:
- fr
library_name: transformers
tags:
- NMT
- commonvoice
- pytorch
- pictograms
- translation
metrics:
- sacrebleu
inference: false
---

# t2p-nmt-commonvoice

*t2p-nmt-commonvoice* is a text-to-pictograms translation model built by training from scratch the [NMT](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
The model is used only for **inference**. 

## Training details

The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md).

### Datasets

The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus. 
This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
| **Split** | **Number of utterances** |
|:-----------:|:-----------------------:|
| train | 527,390 |
| valid | 16,124 |
| test | 16,120 |

### Parameters

This is the arguments in the training pipeline :

```bash
fairseq-train \
    data-bin/commonvoice.tokenized.fr-frp \
    --arch transformer_iwslt_de_en --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --save-dir exp_commonvoice/checkpoints/nmt_fr_frp_commonvoice \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --max-epoch 40 \
    --keep-best-checkpoints 5 \
    --keep-last-epochs 5
```

### Evaluation

The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis.

```bash
fairseq-generate exp_commonvoice/data-bin/commonvoice.tokenized.fr-frp \
    --path exp_commonvoice/checkpoints/nmt_fr_frp_commonvoice/checkpoint.best_bleu_86.0600.pt \
    --batch-size 128 --beam 5 --remove-bpe > gen_cv.out
```
The output file prints the following information :
```txt
S-2724	la planète terre
T-2724	le planète_terre
H-2724	-0.08702446520328522	le planète_terre
D-2724	-0.08702446520328522	le planète_terre
P-2724	-0.1058 -0.0340 -0.1213
Generate test with beam=5: BLEU4 = 82.60, 92.5/85.5/79.5/74.1 (BP=1.000, ratio=1.027, syslen=138507, reflen=134811)
```

### Results

Comparison to other translation models :
| **Model** | **validation** | **test** |
|:-----------:|:-----------------------:|:-----------------------:|
| t2p-t5-large-commonvoice | 86.3 | 86.5 |
| **t2p-nmt-commonvoice** | 86.0 | 82.6 | 
| t2p-mbart-large-cc25-commonvoice | 72.3 | 72.3 |
| t2p-nllb-200-distilled-600M-commonvoice | **87.4** | **87.6** |

### Environmental Impact

Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 2 hours in total.

## Using t2p-nmt-commonvoice model

The scripts to use the *t2p-nmt-commonvoice* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).

## Information

- **Language(s):** French
- **License:** Apache-2.0
- **Developed by:** Cécile Macaire
- **Funded by**
  - GENCI-IDRIS (Grant 2023-AD011013625R1)
  - PROPICTO ANR-20-CE93-0005
- **Authors**
  - Cécile Macaire
  - Chloé Dion
  - Emmanuelle Esperança-Rodier
  - Benjamin Lecouteux
  - Didier Schwab


## Citation

If you use this model for your own research work, please cite as follows:

```bibtex
@inproceedings{macaire_jeptaln2024,
  title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
  author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
  url = {https://inria.hal.science/hal-04623007},
  booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
  address = {Toulouse, France},
  publisher = {{ATALA \& AFPC}},
  volume = {1 : articles longs et prises de position},
  pages = {22-35},
  year = {2024}
}
```