cecilemacaire commited on
Commit
0e1dea6
1 Parent(s): 07920fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -6
README.md CHANGED
@@ -28,18 +28,18 @@ widget:
28
 
29
  ### Datasets
30
 
31
- We used the [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CEFC-Orféo corpus.
32
  This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
33
 
34
  | **Split** | **Number of utterances** |
35
- |-----------|:-----------------------:|
36
  | train | 231,374 |
37
  | valid | 28,796 |
38
  | test | 29,009 |
39
 
40
  ### Parameters
41
 
42
- A full list of the parameters is available in the config.json file. We specified the following arguments in the training pipeline :
43
 
44
  ```python
45
  training_args = Seq2SeqTrainingArguments(
@@ -66,7 +66,7 @@ The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-
66
 
67
  Comparison to other translation models :
68
  | **Model** | **validation** | **test** |
69
- |-----------|:-----------------------:|:-----------------------:|
70
  | **t2p-t5-large-orféo** | 85.2 | 85.8 |
71
  | t2p-nmt-orféo | **87.2** | **87.4** |
72
  | t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
@@ -74,7 +74,7 @@ Comparison to other translation models :
74
 
75
  ### Environmental Impact
76
 
77
-
78
 
79
  ## Using t2p-t5-large-orféo model with HuggingFace transformers
80
 
@@ -94,6 +94,8 @@ outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True,
94
  pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
95
  ```
96
 
 
 
97
  ## Information
98
 
99
  - **Language(s):** French
@@ -115,5 +117,15 @@ pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
115
  If you use this model for your own research work, please cite as follows:
116
 
117
  ```bibtex
118
-
 
 
 
 
 
 
 
 
 
 
119
  ```
 
28
 
29
  ### Datasets
30
 
31
+ The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
32
  This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
33
 
34
  | **Split** | **Number of utterances** |
35
+ |:-----------:|:-----------------------:|
36
  | train | 231,374 |
37
  | valid | 28,796 |
38
  | test | 29,009 |
39
 
40
  ### Parameters
41
 
42
+ A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :
43
 
44
  ```python
45
  training_args = Seq2SeqTrainingArguments(
 
66
 
67
  Comparison to other translation models :
68
  | **Model** | **validation** | **test** |
69
+ |:-----------:|:-----------------------:|:-----------------------:|
70
  | **t2p-t5-large-orféo** | 85.2 | 85.8 |
71
  | t2p-nmt-orféo | **87.2** | **87.4** |
72
  | t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
 
74
 
75
  ### Environmental Impact
76
 
77
+ Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 16 hours in total.
78
 
79
  ## Using t2p-t5-large-orféo model with HuggingFace transformers
80
 
 
94
  pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
95
  ```
96
 
97
+ ## Linking and viewing the predicted sequence of tokens to the corresponding ARASAAC pictograms
98
+
99
  ## Information
100
 
101
  - **Language(s):** French
 
117
  If you use this model for your own research work, please cite as follows:
118
 
119
  ```bibtex
120
+ @inproceedings{macaire_jeptaln2024,
121
+ title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
122
+ author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
123
+ url = {https://inria.hal.science/hal-04623007},
124
+ booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
125
+ address = {Toulouse, France},
126
+ publisher = {{ATALA \& AFPC}},
127
+ volume = {1 : articles longs et prises de position},
128
+ pages = {22-35},
129
+ year = {2024}
130
+ }
131
  ```