cecilemacaire commited on
Commit
4cceaba
1 Parent(s): 62273ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -3
README.md CHANGED
@@ -1,3 +1,160 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - fr
5
+ library_name: transformers
6
+ tags:
7
+ - nllb
8
+ - commonvoice
9
+ - orfeo
10
+ - tedx
11
+ - pytorch
12
+ - pictograms
13
+ - translation
14
+ metrics:
15
+ - sacrebleu
16
+ widget:
17
+ - text: "je mange une pomme"
18
+ example_title: "A simple sentence"
19
+ - text: "je ne pense pas à toi"
20
+ example_title: "Sentence with a negation"
21
+ - text: "il y a 2 jours, les gendarmes ont vérifié ma licence"
22
+ example_title: "Sentence with a polylexical term"
23
+ ---
24
+
25
+ # t2p-nllb-200-distilled-600M-all
26
+
27
+ *t2p-nllb-200-distilled-600M-all* is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
28
+ The model is used only for **inference**.
29
+
30
+ ## Training details
31
+
32
+ ### Datasets
33
+
34
+ The model was fine-tuned on a set of 4 training datasets :
35
+ - [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CommmonVoice v.15.0 corpus.
36
+ - [Propicto-orfeo dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CEFC-orféo corpus.
37
+ - Propicto-tedx dataset, which was created from the French part of the Multilingual TEDx corpus.
38
+ - Propicto-polylexical, a dataset built from scratch with sentences and pictogram translations containing polylexical terms (only used for training to augment the data).
39
+
40
+ All the datasets were built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
41
+
42
+ | **Corpus** | **train** | **valid** | **test** |
43
+ |:-----------:|:-------:|:-------:|:-------:|
44
+ | Propicto-commonvoice | 527,390 | 16,124 | 16,120 |
45
+ | Propicto-orfeo | 231,374 | 28,796 | 29,009 |
46
+ | Propicto-tedx | 85,106 | 749 | 804 |
47
+ | Propicto-polylexical | 1,462 | - | - |
48
+ |**TOTAL** | **845,332** | **45,669** | **45,933** |
49
+
50
+ ### Parameters
51
+
52
+ A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :
53
+
54
+ ```python
55
+ training_args = Seq2SeqTrainingArguments(
56
+ output_dir="checkpoints_corpus_v2/",
57
+ evaluation_strategy="epoch",
58
+ save_strategy="epoch",
59
+ learning_rate=2e-5,
60
+ per_device_train_batch_size=32,
61
+ per_device_eval_batch_size=32,
62
+ weight_decay=0.01,
63
+ save_total_limit=3,
64
+ num_train_epochs=40,
65
+ predict_with_generate=True,
66
+ fp16=True,
67
+ load_best_model_at_end=True
68
+ )
69
+ ```
70
+
71
+ ### Evaluation
72
+
73
+ The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.
74
+
75
+ ### Results
76
+
77
+ | **Model** | **validation** | **test** |
78
+ |:-----------:|:-----------------------:|:-----------------------:|
79
+ | t2p-nllb-200-distilled-600M-all | 92.4 | - |
80
+
81
+ ### Environmental Impact
82
+
83
+ Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory, which took 8.5 hours in total.
84
+
85
+ ## Using t2p-nllb-200-distilled-600M-all model with HuggingFace transformers
86
+
87
+ ```python
88
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
89
+
90
+ source_lang = "fr"
91
+ target_lang = "frp"
92
+ max_input_length = 128
93
+ max_target_length = 128
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all")
96
+ model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all")
97
+
98
+ inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
99
+ outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
100
+ pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
101
+ ```
102
+
103
+ ## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms
104
+
105
+ ```python
106
+ import pandas as pd
107
+
108
+ def process_output_trad(pred):
109
+ return pred.split()
110
+
111
+ def read_lexicon(lexicon):
112
+ df = pd.read_csv(lexicon, sep='\t')
113
+ df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_')
114
+ return df
115
+
116
+ def get_id_picto_from_predicted_lemma(df_lexicon, lemma):
117
+ id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist()
118
+ return (id_picto[0], lemma) if id_picto else (0, lemma)
119
+
120
+ lexicon = read_lexicon("lexicon.csv")
121
+ sentence_to_map = process_output_trad(pred)
122
+ pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map]
123
+ ```
124
+
125
+ ## Viewing the predicted sequence of ARASAAC pictograms in a HTML file
126
+
127
+ ```python
128
+ def generate_html(ids):
129
+ html_content = '<html><body>'
130
+ for picto_id, lemma in ids:
131
+ if picto_id != 0: # ignore invalid IDs
132
+ img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png"
133
+ html_content += f'''
134
+ <figure style="display:inline-block; margin:1px;">
135
+ <img src="{img_url}" alt="{lemma}" width="200" height="200" />
136
+ <figcaption>{lemma}</figcaption>
137
+ </figure>
138
+ '''
139
+ html_content += '</body></html>'
140
+ return html_content
141
+
142
+ html = generate_html(pictogram_ids)
143
+ with open("pictograms.html", "w") as file:
144
+ file.write(html)
145
+ ```
146
+
147
+ ## Information
148
+
149
+ - **Language(s):** French
150
+ - **License:** Apache-2.0
151
+ - **Developed by:** Cécile Macaire
152
+ - **Funded by**
153
+ - GENCI-IDRIS (Grant 2023-AD011013625R1)
154
+ - PROPICTO ANR-20-CE93-0005
155
+ - **Authors**
156
+ - Cécile Macaire
157
+ - Chloé Dion
158
+ - Emmanuelle Esperança-Rodier
159
+ - Benjamin Lecouteux
160
+ - Didier Schwab