Spaces:

zaidmehdi
/

arabic-dialect-classifier

Sleeping

App Files Files Community

zaidmehdi commited on Mar 6, 2024

Commit

f9a4b3a

1 Parent(s): 9d26e62

update docs

Browse files

Files changed (1) hide show

docs/classifier_model.md +15 -15

docs/classifier_model.md CHANGED Viewed

@@ -30,7 +30,8 @@ Given some arabic text, the goal is to classify it into one of 21 labels:
 - Sudan
-Some countries don't have a lot of observations, which means that it might be harder to detect their dialects. We need to take this into consideration when training and evaluating the model (by assigning weights/oversampling, and by choosing appropriate evaluation metrics):
 ![distribution of train labels](images/train_labels.png)
 ![distribution of test labels](images/test_labels.png)
@@ -39,24 +40,23 @@ For the first iteration, we will convert the tweets into vector embeddings using
 We get the following results:
-Logistic Regression
---------------------------------------------------
-Train set:
-Accuracy: 0.3448095238095238
-F1 macro average: 0.30283202516650803
-F1 weighted average: 0.35980803167526537
---------------------------------------------------
-Test set:
-Accuracy: 0.2324
-F1 macro average: 0.15894661492139023
-F1 weighted average: 0.2680459740545796
-We see that the model is struggling to correctly classify the different dialects, (which makes sense because everything is in arabic at the end of the day). Let's have a look at the confusion matrix.
 ![Confusion Matrix](images/iteration1_cm.png)
 From the confusion matrix, we see that the model is only really able to detect Egyptian arabic, and to a lesser extent Iraqi and Algerian.
 ## 3. Conclusion
-It is hard to classify the arabic dialects with a simple approach such as a multinomial logistic regression trained on top of those vector embeddings. One potential reason could be related to the limitations of the training dataset:
 Due to the way it was collected, it is labeling the text because of the location of the tweet, instead of the actual content of the text. As a result, it is possible that a lot of the tweets labeled as a dialect of a given country, contain in reality some text in Modern Standard Arabic, or in another dialect, or also possibly in a mix of dialects, which are all common ways in which arabic is used on social media.

 - Sudan
+Some countries don't have a lot of observations, which means that it might be harder to detect their dialects. We need to take this into consideration when training and evaluating the model (by assigning weights/oversampling, and by choosing appropriate evaluation metrics):
 ![distribution of train labels](images/train_labels.png)
 ![distribution of test labels](images/test_labels.png)
 We get the following results:
+**Logistic Regression**
+- Train set:
+    - Accuracy: 0.3448095238095238
+    - F1 macro average: 0.30283202516650803
+    - F1 weighted average: 0.35980803167526537
+- Test set:
+    - Accuracy: 0.2324
+    - F1 macro average: 0.15894661492139023
+    - F1 weighted average: 0.2680459740545796
+We see that the model is struggling to correctly classify the different dialects, (which makes sense because everything is in arabic at the end of the day). Let's have a look at the confusion matrix.
 ![Confusion Matrix](images/iteration1_cm.png)
 From the confusion matrix, we see that the model is only really able to detect Egyptian arabic, and to a lesser extent Iraqi and Algerian.
 ## 3. Conclusion
+It is hard to classify the arabic dialects with a simple approach such as a multinomial logistic regression trained on top of those vector embeddings. One potential reason could be related to the limitations of the training dataset:
 Due to the way it was collected, it is labeling the text because of the location of the tweet, instead of the actual content of the text. As a result, it is possible that a lot of the tweets labeled as a dialect of a given country, contain in reality some text in Modern Standard Arabic, or in another dialect, or also possibly in a mix of dialects, which are all common ways in which arabic is used on social media.