Propicto
/

t2p-nllb-200-distilled-600M-all

+---
+license: apache-2.0
+language:
+- fr
+library_name: transformers
+tags:
+- nllb
+- commonvoice
+- orfeo
+- tedx
+- pytorch
+- pictograms
+- translation
+metrics:
+- sacrebleu
+widget:
+- text: "je mange une pomme"
+  example_title: "A simple sentence"
+- text: "je ne pense pas à toi"
+  example_title: "Sentence with a negation"
+- text: "il y a 2 jours, les gendarmes ont vérifié ma licence"
+  example_title: "Sentence with a polylexical term"
+---
+# t2p-nllb-200-distilled-600M-all
+*t2p-nllb-200-distilled-600M-all* is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
+The model is used only for **inference**.
+## Training details
+### Datasets
+The model was fine-tuned on a set of 4 training datasets :
+- [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CommmonVoice v.15.0 corpus.
+- [Propicto-orfeo dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CEFC-orféo corpus.
+- Propicto-tedx dataset, which was created from the French part of the Multilingual TEDx corpus.
+- Propicto-polylexical, a dataset built from scratch with sentences and pictogram translations containing polylexical terms (only used for training to augment the data).
+All the datasets were built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
+| **Corpus** | **train** |  **valid** |  **test** |
+|:-----------:|:-------:|:-------:|:-------:|
+| Propicto-commonvoice | 527,390 | 16,124 | 16,120 |
+| Propicto-orfeo | 231,374 | 28,796 | 29,009 |
+| Propicto-tedx | 85,106 | 749 | 804 |
+| Propicto-polylexical | 1,462 | - | - |
+|**TOTAL** | **845,332** | **45,669** | **45,933** |
+### Parameters
+A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :
+```python
+training_args = Seq2SeqTrainingArguments(
+    output_dir="checkpoints_corpus_v2/",
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=2e-5,
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    weight_decay=0.01,
+    save_total_limit=3,
+    num_train_epochs=40,
+    predict_with_generate=True,
+    fp16=True,
+    load_best_model_at_end=True
+)
+```
+### Evaluation
+The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.
+### Results
+| **Model** | **validation** | **test** |
+|:-----------:|:-----------------------:|:-----------------------:|
+| t2p-nllb-200-distilled-600M-all | 92.4 | - |
+### Environmental Impact
+Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory, which took 8.5 hours in total.
+## Using t2p-nllb-200-distilled-600M-all model with HuggingFace transformers
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+source_lang = "fr"
+target_lang = "frp"
+max_input_length = 128
+max_target_length = 128
+tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all")
+model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all")
+inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
+outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
+pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
+```
+## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms
+```python
+import pandas as pd
+def process_output_trad(pred):
+    return pred.split()
+def read_lexicon(lexicon):
+    df = pd.read_csv(lexicon, sep='\t')
+    df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_')
+    return df
+def get_id_picto_from_predicted_lemma(df_lexicon, lemma):
+    id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist()
+    return (id_picto[0], lemma) if id_picto else (0, lemma)
+lexicon = read_lexicon("lexicon.csv")
+sentence_to_map = process_output_trad(pred)
+pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map]
+```
+## Viewing the predicted sequence of ARASAAC pictograms in a HTML file
+```python
+def generate_html(ids):
+    html_content = '<html><body>'
+    for picto_id, lemma in ids:
+        if picto_id != 0:  # ignore invalid IDs
+            img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png"
+            html_content += f'''
+            <figure style="display:inline-block; margin:1px;">
+                <img src="{img_url}" alt="{lemma}" width="200" height="200" />
+                <figcaption>{lemma}</figcaption>
+            </figure>
+            '''
+    html_content += '</body></html>'
+    return html_content
+html = generate_html(pictogram_ids)
+with open("pictograms.html", "w") as file:
+    file.write(html)
+```
+## Information
+- **Language(s):** French
+- **License:** Apache-2.0
+- **Developed by:** Cécile Macaire
+- **Funded by**
+  - GENCI-IDRIS (Grant 2023-AD011013625R1)
+  - PROPICTO ANR-20-CE93-0005
+- **Authors**
+  - Cécile Macaire
+  - Chloé Dion
+  - Emmanuelle Esperança-Rodier
+  - Benjamin Lecouteux
+  - Didier Schwab