Transformers
Amharic
English
Inference Endpoints

AmhT5 Tokenizer

A T5 Tokenizer trained for the Amharic language.

The tokenizer has a Fertility rate: 1.8328

Notebook used for training: https://colab.research.google.com/drive/1B-pca9jpadTHz9WYTWXzPM-A1cTaltYo#scrollTo=wLslLc0D6TnA

Model Details

Model Description

An MT5Tokenizer based Amharic and English tokenizer trained using Fineweb and Wura datasets. This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English. To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as google/mt5-small.

MT5 Tokenizer Vs AmhT5 Tokenizer

from transformers import MT5TokenizerFast

mt5 = "google/mt5-small"

TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False)
tokens = TOKENIZER.tokenize("αŠ¨αˆ˜α‹²αŠ“α‹‹ α‰ α‰…αˆ­α‰₯ αˆ­α‰€α‰΅ αˆ‹α‹­ α‰ αˆα‰΅αŒˆαŠ˜α‹ αŠ¨α‰°αˆ›")

print(len(tokens)) # 20
print(tokens)
# ['β–αŠ¨αˆ˜', 'α‹²', 'αŠ“', 'α‹‹', '▁በ', 'α‰…αˆ­', 'α‰₯', '▁', 'ር', 'ቀ', 'ቡ', '▁', 'αˆ‹α‹­', 'β–α‰ αˆ', 'ቡ', 'ገ', 'ኘ', 'ው', 'β–αŠ¨α‰°', 'αˆ›']


tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")

print(len(tokens)) # 11
print(tokens)
# ['▁A', '▁', 'Token', 'izer', '▁train', 'ed', '▁for', '▁Am', 'haric', '▁language', '.']


amhT5 = "yonas/AmhT5-tokenizer"
TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
tokens = TOKENIZER.tokenize("αŠ¨αˆ˜α‹²αŠ“α‹‹ α‰ α‰…αˆ­α‰₯ αˆ­α‰€α‰΅ αˆ‹α‹­ α‰ αˆα‰΅αŒˆαŠ˜α‹ αŠ¨α‰°αˆ›")

print(len(tokens)) # 11
print(tokens)
# ['β–αŠ¨', 'αˆ˜α‹²αŠ“', 'α‹‹', '▁በ', 'α‰…αˆ­α‰₯', '▁', 'αˆ­α‰€α‰΅', 'β–αˆ‹α‹­', 'β–α‰ αˆα‰΅', 'αŒˆαŠ˜α‹', 'β–αŠ¨α‰°αˆ›']


tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")

print(len(tokens)) # 7
print(tokens)
# ['▁A', '▁Token', 'izer', '▁trained', '▁for', '▁Amharic', '▁language.']
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Datasets used to train yonas/AmhT5-tokenizer