zaidmehdi commited on
Commit
f9a4b3a
1 Parent(s): 9d26e62

update docs

Browse files
Files changed (1) hide show
  1. docs/classifier_model.md +15 -15
docs/classifier_model.md CHANGED
@@ -30,7 +30,8 @@ Given some arabic text, the goal is to classify it into one of 21 labels:
30
  - Sudan
31
 
32
 
33
- Some countries don't have a lot of observations, which means that it might be harder to detect their dialects. We need to take this into consideration when training and evaluating the model (by assigning weights/oversampling, and by choosing appropriate evaluation metrics):
 
34
  ![distribution of train labels](images/train_labels.png)
35
  ![distribution of test labels](images/test_labels.png)
36
 
@@ -39,24 +40,23 @@ For the first iteration, we will convert the tweets into vector embeddings using
39
 
40
  We get the following results:
41
 
42
- Logistic Regression
43
- --------------------------------------------------
44
- Train set:
45
- Accuracy: 0.3448095238095238
46
- F1 macro average: 0.30283202516650803
47
- F1 weighted average: 0.35980803167526537
48
- --------------------------------------------------
49
- Test set:
50
- Accuracy: 0.2324
51
- F1 macro average: 0.15894661492139023
52
- F1 weighted average: 0.2680459740545796
53
-
54
- We see that the model is struggling to correctly classify the different dialects, (which makes sense because everything is in arabic at the end of the day). Let's have a look at the confusion matrix.
55
  ![Confusion Matrix](images/iteration1_cm.png)
56
 
57
  From the confusion matrix, we see that the model is only really able to detect Egyptian arabic, and to a lesser extent Iraqi and Algerian.
58
 
59
  ## 3. Conclusion
60
 
61
- It is hard to classify the arabic dialects with a simple approach such as a multinomial logistic regression trained on top of those vector embeddings. One potential reason could be related to the limitations of the training dataset:
62
  Due to the way it was collected, it is labeling the text because of the location of the tweet, instead of the actual content of the text. As a result, it is possible that a lot of the tweets labeled as a dialect of a given country, contain in reality some text in Modern Standard Arabic, or in another dialect, or also possibly in a mix of dialects, which are all common ways in which arabic is used on social media.
 
30
  - Sudan
31
 
32
 
33
+ Some countries don't have a lot of observations, which means that it might be harder to detect their dialects. We need to take this into consideration when training and evaluating the model (by assigning weights/oversampling, and by choosing appropriate evaluation metrics):
34
+
35
  ![distribution of train labels](images/train_labels.png)
36
  ![distribution of test labels](images/test_labels.png)
37
 
 
40
 
41
  We get the following results:
42
 
43
+ **Logistic Regression**
44
+ - Train set:
45
+ - Accuracy: 0.3448095238095238
46
+ - F1 macro average: 0.30283202516650803
47
+ - F1 weighted average: 0.35980803167526537
48
+ - Test set:
49
+ - Accuracy: 0.2324
50
+ - F1 macro average: 0.15894661492139023
51
+ - F1 weighted average: 0.2680459740545796
52
+
53
+ We see that the model is struggling to correctly classify the different dialects, (which makes sense because everything is in arabic at the end of the day). Let's have a look at the confusion matrix.
54
+
 
55
  ![Confusion Matrix](images/iteration1_cm.png)
56
 
57
  From the confusion matrix, we see that the model is only really able to detect Egyptian arabic, and to a lesser extent Iraqi and Algerian.
58
 
59
  ## 3. Conclusion
60
 
61
+ It is hard to classify the arabic dialects with a simple approach such as a multinomial logistic regression trained on top of those vector embeddings. One potential reason could be related to the limitations of the training dataset:
62
  Due to the way it was collected, it is labeling the text because of the location of the tweet, instead of the actual content of the text. As a result, it is possible that a lot of the tweets labeled as a dialect of a given country, contain in reality some text in Modern Standard Arabic, or in another dialect, or also possibly in a mix of dialects, which are all common ways in which arabic is used on social media.