Spaces:
Sleeping
Sleeping
update readme
Browse files
README.md
CHANGED
@@ -26,14 +26,36 @@ http://localhost:8080
|
|
26 |
The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
|
27 |
It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.
|
28 |
|
29 |
-
In the current version, I
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
For more details, you can refer to the docs directory.
|
32 |
|
33 |
## Releases
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
### v0.0.1
|
35 |
In the first release, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
|
36 |
-
|
|
|
37 |
|
38 |
## References:
|
39 |
- <a name="cite-mageed-2021"></a>
|
|
|
26 |
The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
|
27 |
It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.
|
28 |
|
29 |
+
In the current version, I finetuned the language model `https://huggingface.co/moussaKam/AraBART` by attaching to it a classification head and freezing the weights of the base model (due to compute constraints):
|
30 |
+
```
|
31 |
+
(classification_head): MBartClassificationHead(
|
32 |
+
(dense): Linear(in_features=768, out_features=768, bias=True)
|
33 |
+
(dropout): Dropout(p=0.0, inplace=False)
|
34 |
+
(out_proj): Linear(in_features=768, out_features=21, bias=True)
|
35 |
+
)
|
36 |
+
```
|
37 |
+
The model classifies any input text into one of the 21 countries that we have in the dialects dataset.
|
38 |
+
Currently, it achieves an accuracy of 0.3466 on the test set.
|
39 |
|
40 |
For more details, you can refer to the docs directory.
|
41 |
|
42 |
## Releases
|
43 |
+
### v0.0.2
|
44 |
+
In the second release, I finetuned the langage model `https://huggingface.co/moussaKam/AraBART` by attaching to it a classification head and freezing the weights of the base model (due to compute constraints):
|
45 |
+
```
|
46 |
+
(classification_head): MBartClassificationHead(
|
47 |
+
(dense): Linear(in_features=768, out_features=768, bias=True)
|
48 |
+
(dropout): Dropout(p=0.0, inplace=False)
|
49 |
+
(out_proj): Linear(in_features=768, out_features=21, bias=True)
|
50 |
+
)
|
51 |
+
```
|
52 |
+
**Accuracy achieved on test set: 0.3466**
|
53 |
+
|
54 |
+
|
55 |
### v0.0.1
|
56 |
In the first release, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
|
57 |
+
|
58 |
+
**Accuracy achieved on test set: 0.2324**
|
59 |
|
60 |
## References:
|
61 |
- <a name="cite-mageed-2021"></a>
|