Transformers
Amharic
English
Inference Endpoints
AmhT5-tokenizer / README.md
yonas's picture
Update README.md
37afd82 verified
|
raw
history blame
2.11 kB
---
library_name: transformers
license: cc-by-4.0
datasets:
- HuggingFaceFW/fineweb
- castorini/wura
language:
- am
- en
---
# AmhT5 Tokenizer
A T5 Tokenizer trained for the Amharic language.
The tokenizer has a Fertility rate: 1.8328
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets.
This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English.
To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as `google/mt5-small`.
### MT5 Tokenizer Vs AmhT5 Tokenizer
```python
from transformers import MT5TokenizerFast
mt5 = "google/mt5-small"
TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False)
tokens = TOKENIZER.tokenize("αŠ¨αˆ˜α‹²αŠ“α‹‹ α‰ α‰…αˆ­α‰₯ αˆ­α‰€α‰΅ αˆ‹α‹­ α‰ αˆα‰΅αŒˆαŠ˜α‹ αŠ¨α‰°αˆ›")
print(len(tokens)) # 20
print(tokens)
# ['β–αŠ¨αˆ˜', 'α‹²', 'αŠ“', 'α‹‹', '▁በ', 'α‰…αˆ­', 'α‰₯', '▁', 'ር', 'ቀ', 'ቡ', '▁', 'αˆ‹α‹­', 'β–α‰ αˆ', 'ቡ', 'ገ', 'ኘ', 'ው', 'β–αŠ¨α‰°', 'αˆ›']
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
print(len(tokens)) # 11
print(tokens)
# ['▁A', '▁', 'Token', 'izer', '▁train', 'ed', '▁for', '▁Am', 'haric', '▁language', '.']
amhT5 = "yonas/AmhT5-tokenizer"
TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
tokens = TOKENIZER.tokenize("αŠ¨αˆ˜α‹²αŠ“α‹‹ α‰ α‰…αˆ­α‰₯ αˆ­α‰€α‰΅ αˆ‹α‹­ α‰ αˆα‰΅αŒˆαŠ˜α‹ αŠ¨α‰°αˆ›")
print(len(tokens)) # 11
print(tokens)
# ['β–αŠ¨', 'αˆ˜α‹²αŠ“', 'α‹‹', '▁በ', 'α‰…αˆ­α‰₯', '▁', 'αˆ­α‰€α‰΅', 'β–αˆ‹α‹­', 'β–α‰ αˆα‰΅', 'αŒˆαŠ˜α‹', 'β–αŠ¨α‰°αˆ›']
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
print(len(tokens)) # 7
print(tokens)
# ['▁A', '▁Token', 'izer', '▁trained', '▁for', '▁Amharic', '▁language.']
```