metadata

title: Arabic Dialect Classifier
emoji: 🐪
colorFrom: yellow
colorTo: yellow
sdk: docker
app_port: 8080
license: mit
pinned: false

Arabic Dialect Classifier

This project is a classifier of arabic dialects at a country level:
Given some arabic text, the goal is to predict the country of the text's dialect.

Link to the Demo

Run the app locally with Docker:

Clone the repository with Git:

git clone https://github.com/zaidmehdi/arabic-dialect-classifier.git

Build the Docker image:

sudo docker build -t adc .

Run the Docker Container:

sudo docker run -p 8080:8080 adc

Now you can access the demo locally at:

http://localhost:8080

How I built this project:

The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification (Abdul-Mageed et al., 2021).
It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.

In the current version, I finetuned the language model https://huggingface.co/moussaKam/AraBART by attaching to it a classification head and freezing the weights of the base model (due to compute constraints):

(classification_head): MBartClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.0, inplace=False)
    (out_proj): Linear(in_features=768, out_features=21, bias=True)
)

The model classifies any input text into one of the 21 countries that we have in the dialects dataset. Currently, it achieves an accuracy of 0.3466 on the test set.

For more details, you can refer to the docs directory.

Releases

v0.0.2

In the second release, I finetuned the langage model https://huggingface.co/moussaKam/AraBART by attaching to it a classification head and freezing the weights of the base model (due to compute constraints):

(classification_head): MBartClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.0, inplace=False)
    (out_proj): Linear(in_features=768, out_features=21, bias=True)
)

Accuracy achieved on test set: 0.3466

v0.0.1

In the first release, I used the language model https://huggingface.co/moussaKam/AraBART to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).

Accuracy achieved on test set: 0.2324

References:

Abdul-Mageed et al., 2021
Title: NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
Authors: Abdul-Mageed, Muhammad; Zhang, Chiyu; Elmadany, AbdelRahim; Bouamor, Houda; Habash, Nizar
Year: 2021
Conference/Book Title: Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021)