|
--- |
|
library_name: transformers |
|
license: cc-by-4.0 |
|
datasets: |
|
- HuggingFaceFW/fineweb |
|
- castorini/wura |
|
language: |
|
- am |
|
- en |
|
--- |
|
|
|
# AmhT5 Tokenizer |
|
|
|
A T5 Tokenizer trained for the Amharic language. |
|
|
|
The tokenizer has a Fertility rate: 1.8328 |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets. |
|
This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English. |
|
To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as `google/mt5-small`. |
|
|
|
### MT5 Tokenizer Vs AmhT5 Tokenizer |
|
|
|
```python |
|
from transformers import MT5TokenizerFast |
|
|
|
mt5 = "google/mt5-small" |
|
|
|
TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False) |
|
tokens = TOKENIZER.tokenize("α¨αα²αα α α
αα₯ ααα΅ αα α αα΅ααα α¨α°α") |
|
|
|
print(len(tokens)) # 20 |
|
print(tokens) |
|
# ['βα¨α', 'α²', 'α', 'α', 'βα ', 'α
α', 'α₯', 'β', 'α', 'α', 'α΅', 'β', 'αα', 'βα α', 'α΅', 'α', 'α', 'α', 'βα¨α°', 'α'] |
|
|
|
|
|
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.") |
|
|
|
print(len(tokens)) # 11 |
|
print(tokens) |
|
# ['βA', 'β', 'Token', 'izer', 'βtrain', 'ed', 'βfor', 'βAm', 'haric', 'βlanguage', '.'] |
|
|
|
|
|
amhT5 = "yonas/AmhT5-tokenizer" |
|
TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False) |
|
tokens = TOKENIZER.tokenize("α¨αα²αα α α
αα₯ ααα΅ αα α αα΅ααα α¨α°α") |
|
|
|
print(len(tokens)) # 11 |
|
print(tokens) |
|
# ['βα¨', 'αα²α', 'α', 'βα ', 'α
αα₯', 'β', 'ααα΅', 'βαα', 'βα αα΅', 'ααα', 'βα¨α°α'] |
|
|
|
|
|
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.") |
|
|
|
print(len(tokens)) # 7 |
|
print(tokens) |
|
# ['βA', 'βToken', 'izer', 'βtrained', 'βfor', 'βAmharic', 'βlanguage.'] |
|
``` |
|
|