# Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:
## Proposed Method:
The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.
## Main files:
All files are IPython Notebook files which can be excuted simply in Google Colab.
- train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bSSrKE-TSphUyrNB2UWhFI-Bkoz0a5l0?usp=sharing)
- test.ipynb : Evaluates the fine-tuned model on test data. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17R1lvRjxILVNk971vzZT79o2OodwaNIX?usp=sharing)
- token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1hCtetAllAjBw28EVQkJFpiKdFtXmuxV7?usp=sharing)
- process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-ndLlBV-iLZNBW3Z8OfKAqUUCjvGbdZU?usp=sharing)
- extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1b7G85wHKe18y45JIDtvDJdO5dOkRmDdp?usp=sharing)
- auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on [Emoji's sentiment](http://kt.ijs.si/data/Emoji_sentiment_ranking/). [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wnZf7CBBCIr966vRUITlxKCrANsMPpV7?usp=sharing)
## Tigrinya Tokenizer:
A [sentencepiece](https://github.com/google/sentencepiece) based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
tokenizer.tokenize("ዋዋዋው እዛ ፍሊም ካብተን ዘድንቀን ሓንቲ ኢያ ሞ ብጣዕሚ ኢና ነመስግን ሓንቲ ክብላ ደልየ ዘሎኹ ሓደራኣኹም ኣብ ጊዜኹም ተረክቡ")
## TigXLNet:
A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("abryee/TigXLNet")
config.d_head = 64
model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)
## Evaluation:
The proposed method is evaluated using two datasets:
- A newly created sentiment analysis dataset for low-resource language (Tigriyna).
Models |
Configuration |
F1-Score |
BERT |
+Frozen BERT weights |
54.91 |
+Random embeddings |
74.26 |
+Frozen token embeddings |
76.35 |
mBERT |
+Frozen mBERT weights |
57.32 |
+Random embeddings |
76.01 |
+Frozen token embeddings |
77.51 |
XLNet |
+Frozen XLNet weights |
68.14 |
+Random embeddings |
77.83 |
+Frozen token embeddings |
81.62 |
|
|
- Cross-lingual Sentiment dataset ([CLS](https://zenodo.org/record/3251672#.Xs65VzozbIU)).
Models |
English |
German |
French |
Japanese |
Average |
Books |
DVD |
Music |
Books |
DVD |
Music |
Books |
DVD |
Music |
Books |
DVD |
Music |
XLNet |
92.90 |
93.31 |
92.02 |
85.23 |
83.30 |
83.89 |
73.05 |
69.80 |
70.12 |
83.20 |
86.07 |
85.24 |
83.08 |
mBERT |
92.78 |
90.30 |
91.88 |
88.65 |
85.85 |
90.38 |
91.09 |
88.57 |
93.67 |
84.35 |
81.77 |
87.53 |
88.90 |
## Dataset used for this paper:
We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)
## Citing our paper:
Our paper can be accessed from ArXiv [link](https://arxiv.org/pdf/2006.07698.pdf), and please consider citing our work.
@misc{tela2020transferring,
title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
year={2020},
eprint={2006.07698},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## Any questions, comments, feedback is appreciated! And can be forwarded to the following email: abrhalei.tela@gmail.com