cecilemacaire
commited on
Commit
•
0e1dea6
1
Parent(s):
07920fc
Update README.md
Browse files
README.md
CHANGED
@@ -28,18 +28,18 @@ widget:
|
|
28 |
|
29 |
### Datasets
|
30 |
|
31 |
-
|
32 |
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
|
33 |
|
34 |
| **Split** | **Number of utterances** |
|
35 |
-
|
36 |
| train | 231,374 |
|
37 |
| valid | 28,796 |
|
38 |
| test | 29,009 |
|
39 |
|
40 |
### Parameters
|
41 |
|
42 |
-
A full list of the parameters is available in the config.json file.
|
43 |
|
44 |
```python
|
45 |
training_args = Seq2SeqTrainingArguments(
|
@@ -66,7 +66,7 @@ The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-
|
|
66 |
|
67 |
Comparison to other translation models :
|
68 |
| **Model** | **validation** | **test** |
|
69 |
-
|
70 |
| **t2p-t5-large-orféo** | 85.2 | 85.8 |
|
71 |
| t2p-nmt-orféo | **87.2** | **87.4** |
|
72 |
| t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
|
@@ -74,7 +74,7 @@ Comparison to other translation models :
|
|
74 |
|
75 |
### Environmental Impact
|
76 |
|
77 |
-
|
78 |
|
79 |
## Using t2p-t5-large-orféo model with HuggingFace transformers
|
80 |
|
@@ -94,6 +94,8 @@ outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True,
|
|
94 |
pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
95 |
```
|
96 |
|
|
|
|
|
97 |
## Information
|
98 |
|
99 |
- **Language(s):** French
|
@@ -115,5 +117,15 @@ pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
115 |
If you use this model for your own research work, please cite as follows:
|
116 |
|
117 |
```bibtex
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
```
|
|
|
28 |
|
29 |
### Datasets
|
30 |
|
31 |
+
The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
|
32 |
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
|
33 |
|
34 |
| **Split** | **Number of utterances** |
|
35 |
+
|:-----------:|:-----------------------:|
|
36 |
| train | 231,374 |
|
37 |
| valid | 28,796 |
|
38 |
| test | 29,009 |
|
39 |
|
40 |
### Parameters
|
41 |
|
42 |
+
A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :
|
43 |
|
44 |
```python
|
45 |
training_args = Seq2SeqTrainingArguments(
|
|
|
66 |
|
67 |
Comparison to other translation models :
|
68 |
| **Model** | **validation** | **test** |
|
69 |
+
|:-----------:|:-----------------------:|:-----------------------:|
|
70 |
| **t2p-t5-large-orféo** | 85.2 | 85.8 |
|
71 |
| t2p-nmt-orféo | **87.2** | **87.4** |
|
72 |
| t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
|
|
|
74 |
|
75 |
### Environmental Impact
|
76 |
|
77 |
+
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 16 hours in total.
|
78 |
|
79 |
## Using t2p-t5-large-orféo model with HuggingFace transformers
|
80 |
|
|
|
94 |
pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
95 |
```
|
96 |
|
97 |
+
## Linking and viewing the predicted sequence of tokens to the corresponding ARASAAC pictograms
|
98 |
+
|
99 |
## Information
|
100 |
|
101 |
- **Language(s):** French
|
|
|
117 |
If you use this model for your own research work, please cite as follows:
|
118 |
|
119 |
```bibtex
|
120 |
+
@inproceedings{macaire_jeptaln2024,
|
121 |
+
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
|
122 |
+
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
|
123 |
+
url = {https://inria.hal.science/hal-04623007},
|
124 |
+
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
|
125 |
+
address = {Toulouse, France},
|
126 |
+
publisher = {{ATALA \& AFPC}},
|
127 |
+
volume = {1 : articles longs et prises de position},
|
128 |
+
pages = {22-35},
|
129 |
+
year = {2024}
|
130 |
+
}
|
131 |
```
|