Spaces:
Sleeping
Sleeping
update docs
Browse files- docs/classifier_model.md +15 -15
docs/classifier_model.md
CHANGED
@@ -30,7 +30,8 @@ Given some arabic text, the goal is to classify it into one of 21 labels:
|
|
30 |
- Sudan
|
31 |
|
32 |
|
33 |
-
Some countries don't have a lot of observations, which means that it might be harder to detect their dialects. We need to take this into consideration when training and evaluating the model (by assigning weights/oversampling, and by choosing appropriate evaluation metrics):
|
|
|
34 |
![distribution of train labels](images/train_labels.png)
|
35 |
![distribution of test labels](images/test_labels.png)
|
36 |
|
@@ -39,24 +40,23 @@ For the first iteration, we will convert the tweets into vector embeddings using
|
|
39 |
|
40 |
We get the following results:
|
41 |
|
42 |
-
Logistic Regression
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
F1
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
We see that the model is struggling to correctly classify the different dialects, (which makes sense because everything is in arabic at the end of the day). Let's have a look at the confusion matrix.
|
55 |
![Confusion Matrix](images/iteration1_cm.png)
|
56 |
|
57 |
From the confusion matrix, we see that the model is only really able to detect Egyptian arabic, and to a lesser extent Iraqi and Algerian.
|
58 |
|
59 |
## 3. Conclusion
|
60 |
|
61 |
-
It is hard to classify the arabic dialects with a simple approach such as a multinomial logistic regression trained on top of those vector embeddings. One potential reason could be related to the limitations of the training dataset:
|
62 |
Due to the way it was collected, it is labeling the text because of the location of the tweet, instead of the actual content of the text. As a result, it is possible that a lot of the tweets labeled as a dialect of a given country, contain in reality some text in Modern Standard Arabic, or in another dialect, or also possibly in a mix of dialects, which are all common ways in which arabic is used on social media.
|
|
|
30 |
- Sudan
|
31 |
|
32 |
|
33 |
+
Some countries don't have a lot of observations, which means that it might be harder to detect their dialects. We need to take this into consideration when training and evaluating the model (by assigning weights/oversampling, and by choosing appropriate evaluation metrics):
|
34 |
+
|
35 |
![distribution of train labels](images/train_labels.png)
|
36 |
![distribution of test labels](images/test_labels.png)
|
37 |
|
|
|
40 |
|
41 |
We get the following results:
|
42 |
|
43 |
+
**Logistic Regression**
|
44 |
+
- Train set:
|
45 |
+
- Accuracy: 0.3448095238095238
|
46 |
+
- F1 macro average: 0.30283202516650803
|
47 |
+
- F1 weighted average: 0.35980803167526537
|
48 |
+
- Test set:
|
49 |
+
- Accuracy: 0.2324
|
50 |
+
- F1 macro average: 0.15894661492139023
|
51 |
+
- F1 weighted average: 0.2680459740545796
|
52 |
+
|
53 |
+
We see that the model is struggling to correctly classify the different dialects, (which makes sense because everything is in arabic at the end of the day). Let's have a look at the confusion matrix.
|
54 |
+
|
|
|
55 |
![Confusion Matrix](images/iteration1_cm.png)
|
56 |
|
57 |
From the confusion matrix, we see that the model is only really able to detect Egyptian arabic, and to a lesser extent Iraqi and Algerian.
|
58 |
|
59 |
## 3. Conclusion
|
60 |
|
61 |
+
It is hard to classify the arabic dialects with a simple approach such as a multinomial logistic regression trained on top of those vector embeddings. One potential reason could be related to the limitations of the training dataset:
|
62 |
Due to the way it was collected, it is labeling the text because of the location of the tweet, instead of the actual content of the text. As a result, it is possible that a lot of the tweets labeled as a dialect of a given country, contain in reality some text in Modern Standard Arabic, or in another dialect, or also possibly in a mix of dialects, which are all common ways in which arabic is used on social media.
|