|
Hugging Face's logo |
|
--- |
|
language: |
|
- om |
|
- am |
|
- rw |
|
- rn |
|
- ha |
|
- ig |
|
- pcm |
|
- so |
|
- sw |
|
- ti |
|
- yo |
|
- multilingual |
|
datasets: |
|
|
|
--- |
|
# AfriBERTa_small |
|
## Model description |
|
AfriBERTa small is a pretrained multilingual language model with around 97 million parameters. |
|
The model has 4 layers, 6 attention heads, 768 hidden units and 3072 feed forward size. |
|
The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. |
|
The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
You can use this model with Transformers for any downstream task. |
|
For example, assuming we want to finetune this model on a token classification task, we do the following: |
|
|
|
```python |
|
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_small") |
|
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_small") |
|
``` |
|
|
|
#### Limitations and bias |
|
This model is possibly limited by its training dataset which are majorly obtained from news articles from a specific span of time. |
|
Thus, it may not generalize well. |
|
|
|
## Training data |
|
The model was trained on an aggregation of datasets from the BBC news website and Common Crawl. |
|
|
|
## Training procedure |
|
For information on training procedures, please refer to the AfriBERTa [paper]() or [repository](https://github.com/keleog/afriberta) |
|
|
|
### BibTeX entry and citation info |
|
``` |
|
Kelechi Ogueji, Yuxin Zhu, Jimmy Lin. |
|
Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages |
|
Proceedings of the 1st workshop on Multilingual Representation Learning at EMNLP 2021 |
|
``` |
|
|
|
|
|
|