zaidmehdi commited on
Commit
226fd48
1 Parent(s): a0089f9

update readme

Browse files
Files changed (1) hide show
  1. README.md +24 -2
README.md CHANGED
@@ -26,14 +26,36 @@ http://localhost:8080
26
  The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
27
  It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.
28
 
29
- In the current version, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
 
 
 
 
 
 
 
 
 
30
 
31
  For more details, you can refer to the docs directory.
32
 
33
  ## Releases
 
 
 
 
 
 
 
 
 
 
 
 
34
  ### v0.0.1
35
  In the first release, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
36
- ### v0.0.2
 
37
 
38
  ## References:
39
  - <a name="cite-mageed-2021"></a>
 
26
  The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
27
  It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.
28
 
29
+ In the current version, I finetuned the language model `https://huggingface.co/moussaKam/AraBART` by attaching to it a classification head and freezing the weights of the base model (due to compute constraints):
30
+ ```
31
+ (classification_head): MBartClassificationHead(
32
+ (dense): Linear(in_features=768, out_features=768, bias=True)
33
+ (dropout): Dropout(p=0.0, inplace=False)
34
+ (out_proj): Linear(in_features=768, out_features=21, bias=True)
35
+ )
36
+ ```
37
+ The model classifies any input text into one of the 21 countries that we have in the dialects dataset.
38
+ Currently, it achieves an accuracy of 0.3466 on the test set.
39
 
40
  For more details, you can refer to the docs directory.
41
 
42
  ## Releases
43
+ ### v0.0.2
44
+ In the second release, I finetuned the langage model `https://huggingface.co/moussaKam/AraBART` by attaching to it a classification head and freezing the weights of the base model (due to compute constraints):
45
+ ```
46
+ (classification_head): MBartClassificationHead(
47
+ (dense): Linear(in_features=768, out_features=768, bias=True)
48
+ (dropout): Dropout(p=0.0, inplace=False)
49
+ (out_proj): Linear(in_features=768, out_features=21, bias=True)
50
+ )
51
+ ```
52
+ **Accuracy achieved on test set: 0.3466**
53
+
54
+
55
  ### v0.0.1
56
  In the first release, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
57
+
58
+ **Accuracy achieved on test set: 0.2324**
59
 
60
  ## References:
61
  - <a name="cite-mageed-2021"></a>