Update README.md

7f4e54f verified 4 months ago

5.15 kB

	---
	license: apache-2.0
	language:
	- fr
	library_name: transformers
	tags:
	- mbart
	- commonvoice
	- pytorch
	- pictograms
	- translation
	metrics:
	- sacrebleu
	inference: false
	---

	# t2p-mbart-large-cc25-commonvoice

	t2p-mbart-large-cc25-commonvoice is a text-to-pictograms translation model built by fine-tuning the [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
	The model is used only for inference.

	## Training details

	The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md).

	### Datasets

	The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus.
	This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
	\| Split \| Number of utterances \|
	\|:-----------:\|:-----------------------:\|
	\| train \| 527,390 \|
	\| valid \| 16,124 \|
	\| test \| 16,120 \|

	### Parameters

	This is the arguments in the training pipeline :

	```bash
	fairseq-train $DATA \
	--encoder-normalize-before --decoder-normalize-before \
	--arch mbart_large --layernorm-embedding \
	--task translation_from_pretrained_bart \
	--source-lang fr --target-lang frp \
	--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
	--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
	--lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
	--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
	--max-tokens 1024 --update-freq 2 \
	--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 \
	--seed 222 --log-format simple --log-interval 2 \
	--langs $langs \
	--ddp-backend legacy_ddp \
	--max-epoch 40 \
	--save-dir models/checkpoints/mt_mbart_fr_frp_commonvoice_langs \
	--keep-best-checkpoints 5 \
	--keep-last-epochs 5
	```

	### Evaluation

	The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis.

	```bash
	fairseq-generate commonvoice_data/data/ \
	--path $model_dir/checkpoint_best.pt \
	--task translation_from_pretrained_bart \
	--gen-subset test \
	-t frp -s fr \
	--bpe 'sentencepiece' --sentencepiece-model mbart.cc25.v2/sentence.bpe.model \
	--sacrebleu \
	--batch-size 32 --langs $langs > out.txt
	```
	The output file prints the following information :
	```txt
	S-1071 cette collaboration dure trois ans<unk>
	T-1071 le collaboration durer 3 année
	H-1071 -0.2111533135175705 ▁le ▁collaboration ▁dur er ▁3 ▁année
	D-1071 -0.2111533135175705 le collaboration durer 3 année
	P-1071 -0.2783 -0.0584 -0.2309 -0.2009 -0.2145 -0.1210 -0.3330 -0.2523
	Generate test with beam=5: BLEU4 = 72.31, 84.3/77.4/72.3/67.7 (BP=0.962, ratio=0.963, syslen=227722, reflen=236545)
	```

	### Results

	Comparison to other translation models :
	\| Model \| validation \| test \|
	\|:-----------:\|:-----------------------:\|:-----------------------:\|
	\| t2p-t5-large-commonvoice \| 86.3 \| 86.5 \|
	\| t2p-nmt-commonvoice \| 86.0 \| 82.6 \|
	\| t2p-mbart-large-cc25-commonvoice \| 72.3 \| 72.3 \|
	\| t2p-nllb-200-distilled-600M-commonvoice \| 87.4 \| 87.6 \|

	### Environmental Impact

	Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 18 hours in total.

	## Using t2p-mbart-large-cc25-commonvoice

	The scripts to use the t2p-mbart-large-cc25-commonvoice model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).

	## Information

	- Language(s): French
	- License: Apache-2.0
	- Developed by: Cécile Macaire
	- Funded by
	- GENCI-IDRIS (Grant 2023-AD011013625R1)
	- PROPICTO ANR-20-CE93-0005
	- Authors
	- Cécile Macaire
	- Chloé Dion
	- Emmanuelle Esperança-Rodier
	- Benjamin Lecouteux
	- Didier Schwab


	## Citation

	If you use this model for your own research work, please cite as follows:

	```bibtex
	@inproceedings{macaire_jeptaln2024,
	title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
	author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
	url = {https://inria.hal.science/hal-04623007},
	booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
	address = {Toulouse, France},
	publisher = {{ATALA \& AFPC}},
	volume = {1 : articles longs et prises de position},
	pages = {22-35},
	year = {2024}
	}
	```