Trained on 20x More Tokens than Previous Iterations
Byte Fallback BPE Tokenizer
- Trained using google/SPM
- Vocab Size :
72808
Training Args
def get_corpus_iterator():
dataset = load_dataset("fhai50032/pds-tk-specific-2", split="train")
shuffled = dataset.shuffle(seed=42)
for text in shuffled["text"]:
stripped = text.strip()
if stripped:
for i in range(0, len(stripped), 8192):
yield stripped[i: i+8192]
sentence_iterator=get_corpus_iterator(),
model_prefix=tokenizer_name,
vocab_size=vocab_size,
num_threads=num_threads,
model_type="bpe",
max_sentence_length=8192,
character_coverage=1.0,
byte_fallback=True,
shuffle_input_sentence=True,
remove_extra_whitespaces=False,
normalization_rule_name="identity",
Special Tokens
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
Training Composition:
Maths: 550 M
"aluncstokes/mathpile_arxiv_subset"
Code: 800 M
codeparrot/github-code
Hinglish : 250 M
Abhishekcr448/Hinglish-Everyday-Conversations-1M
Maihar/hinglish-80k
English : 2 000 M
"allenai/c4", "en"
Hindi : 2 200 M
aloobun/dhpileIN
,data_dir='hi'
Evals
Tokenization Efficency (Less is Better)
Tokenizer | English | Hindi | Tamil | Bengali | Malayalam | Telugu | Gujarati | Punjabi | Code_Python | Code_Java | c++ | Math | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | deepseek-ai/DeepSeek-R1 (128k) | 338874 | 22855 | 48957 | 39617 | 73928 | 40345 | 101020 | 79172 | 5231 | 2224 | 7055 | 5376 |
1 | unsloth/phi-4 (100k) | 308645 | 40456 | 59750 | 116122 | 149889 | 48689 | 118335 | 87413 | 4809 | 2110 | 6529 | 5573 |
2 | deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) | 308512 | 21110 | 59625 | 115138 | 149883 | 48661 | 118061 | 86765 | 4809 | 2111 | 6530 | 5574 |
3 | unsloth/gemma-2-9b-it(256k) | 323335 | 15916 | 53913 | 53402 | 57219 | 47610 | 107925 | 87222 | 5948 | 2569 | 8639 | 5871 |
4 | Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) | 366710 | 11447 | 61408 | 94191 | 97207 | 50229 | 117874 | 90045 | 8201 | 4000 | 13706 | 5585 |
> | Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) | 330830 | 10318 | 59089 | 93740 | 92655 | 44975 | 109411 | 87922 | 7819 | 3743 | 12953 | 5253 |
6 | Ornaments/72k-TK-BBPE-HF (72k) | 321274 | 10813 | 67585 | 159985 | 193813 | 55654 | 134397 | 97063 | 5225 | 2263 | 7090 | 5150 |
7 | nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) | 332271 | 14327 | 55473 | 36615 | 45783 | 48270 | 160115 | 117174 | 6186 | 2732 | 8861 | 6136 |
8 | sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) | 370133 | 15633 | 67845 | 120340 | 105953 | 68315 | 159122 | 113817 | 6595 | 2792 | 9233 | 6223 |
9 | sarvamai/sarvam-1 (68k) | 385386 | 11257 | 61396 | 27348 | 31822 | 51463 | 119666 | 103344 | 7331 | 3068 | 9724 | 6864 |
Encode-Decode
- Hindi
Input : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['▁ऋतुराज', '▁गायकवाड़', '▁(', 'क', 'प्तान', '),', '▁डे', 'व', 'ोन', '▁कॉ', 'न', 'वे', ',', '▁रच', 'िन', '▁रविंद्र', ',', '▁राहुल', '▁त्रिपाठी', ',', '▁शिवम', '▁दुबे', ',', '▁रविंद्र', '▁जडेजा', ',', '▁एमएस', '▁धोनी', '▁(', 'व', 'िकेट', 'कीपर', '),', '▁आर', '▁अश्विन', ',', '▁मी', 'था', 'शा', '▁पथ', 'िर', 'ाना', ',', '▁ख', 'लील', '▁अहमद', ',', '▁नूर', '▁अहमद', '।']
Encoded: [1, 34862, 26967, 435, 61725, 29148, 1099, 1945, 61754, 1769, 1777, 61735, 981, 61750, 18114, 465, 19049, 61750, 4310, 11042, 61750, 21132, 13133, 61750, 19049, 13624, 61750, 18436, 12473, 435, 61754, 1956, 14572, 1099, 2208, 17618, 61750, 3352, 2063, 2500, 12020, 731, 781, 61750, 429, 13121, 9490, 61750, 26786, 9490, 61770]
Len Tokens 51
Decoded: <s> ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
- English
Input : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['▁Bangalore', '▁and', '▁Chennai', '▁have', '▁faced', '▁each', '▁other', '▁in', '▁33', '▁matches', '▁in', '▁IPL', '.', '▁Out', '▁of', '▁these', '▁33', '▁games', ',', '▁Bangalore', '▁have', '▁won', '▁11', '▁whereas', '▁Chennai', '▁have', '▁come', '▁out', '▁vict', 'orious', '▁on', '▁21', '▁occasion', '.', '▁1', '▁match', '▁ended', '▁without', '▁a', '▁result', '.']
Encoded: [1, 43579, 317, 42140, 607, 21626, 1872, 1022, 313, 7736, 14838, 313, 9863, 61751, 7363, 319, 1517, 7736, 4837, 61750, 43579, 607, 4817, 1730, 22734, 42140, 607, 2968, 811, 9594, 30189, 395, 3209, 13423, 61751, 385, 5083, 13623, 2675, 262, 1773, 61751]
Len Tokens 42
Decoded: <s> Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
- Math
Input : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
Tokens: ['▁%', '▁Change', '▁the', '▁font', '▁if', '▁you', '▁want', '▁to', ',', '▁depending', '▁on', '▁whether', '\n', '%', '▁you', "'", 're', '▁using', '▁pd', 'fl', 'ate', 'x', '▁or', '▁x', 'el', 'ate', 'x', '/', 'l', 'ual', 'ate', 'x', '\n', '%', '▁WH', 'EN', '▁COMP', 'IL', 'ING', '▁WITH', '▁X', 'EL', 'ATE', 'X', '▁PLEASE', '▁USE', '\n', '%', '▁x', 'el', 'ate', 'x', '▁-', 'shell', '-', 'escape', '▁-', 'output', '-', 'driver', '="', 'xd', 'v', 'ip', 'df', 'mx', '▁-', 'z', '▁0', '"', '▁sample', '.', 'tex', '\n\\', 'ift', 'ut', 'ex', '\n', '▁', '▁%', '▁If', '▁using', '▁x', 'el', 'ate', 'x', '▁or', '▁l', 'ual', 'ate', 'x', ':\n', '▁', '▁\\', 'set', 'main', 'font', '{', 'Rob', 'oto', '▁Sl', 'ab', '}\n', '▁', '▁\\', 'sets', 'ans', 'font', '{', 'L', 'ato', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'else', '\n', '▁', '▁%', '▁If', '▁using', '▁pd', 'fl', 'ate', 'x', ':\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'rm', ']{', 'rob', 'oto', '}\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}\n', '▁', '▁%', '▁\\', 'us', 'ep', 'ack', 'age', '{', 'sources', 'ans', 'pro', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'fi', '\n']
Encoded: [1, 2920, 20717, 273, 9731, 686, 380, 1570, 306, 61750, 11224, 395, 4910, 61755, 61863, 380, 61809, 265, 2138, 32887, 2673, 442, 61792, 506, 1894, 335, 442, 61792, 61804, 61729, 1204, 442, 61792, 61755, 61863, 19877, 1920, 27670, 4809, 3922, 25404, 2470, 8534, 6586, 61859, 61046, 22326, 61755, 61863, 1894, 335, 442, 61792, 777, 34320, 61780, 35727, 777, 9020, 61780, 16819, 696, 25014, 61762, 947, 7497, 35801, 777, 61831, 612, 61798, 10079, 61751, 8032, 1207, 2865, 388, 1096, 61755, 61715, 2920, 1608, 2138, 1894, 335, 442, 61792, 506, 334, 1204, 442, 61792, 1025, 61715, 426, 1106, 6972, 5295, 61782, 32606, 5896, 11751, 403, 499, 61715, 426, 8105, 770, 5295, 61782, 61811, 10464, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 7019, 61755, 61715, 2920, 1608, 2138, 32887, 2673, 442, 61792, 1025, 61715, 426, 379, 953, 697, 626, 61846, 1628, 6219, 35451, 5896, 499, 61715, 426, 379, 953, 697, 626, 61846, 41970, 770, 6219, 61729, 10464, 499, 61715, 2920, 426, 379, 953, 697, 626, 61782, 43733, 770, 1194, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 12885, 61755]
Len Tokens 185
Decoded: <s> % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
- Code
Input : class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = "▁",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
Tokens: ['▁class', '▁Sentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(', 'Base', 'Token', 'izer', '):\n', '▁', '▁', '▁', '▁"""', 'Sentence', 'P', 'iece', '▁Un', 'ig', 'ram', '▁Token', 'izer', '\n\n', '▁', '▁', '▁', '▁Rep', 'resents', '▁the', '▁Un', 'ig', 'ram', '▁algorithm', ',', '▁with', '▁the', '▁pre', 'token', 'ization', '▁used', '▁by', '▁Sentence', 'P', 'iece', '\n', '▁', '▁', '▁', '▁"""\n\n', '▁', '▁', '▁', '▁def', '▁__', 'init', '__', '(\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁self', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁voc', 'ab', ':', '▁Optional', '[', 'List', '[', 'Tuple', '[', 'str', ',', '▁float', ']]', ']', '▁=', '▁None', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁replacement', ':', '▁str', '▁=', '▁"', '▁"', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁add', '_', 'prefix', '_', 'space', ':', '▁bool', '▁=', '▁True', ',\n', '▁', '▁', '▁', '▁):\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁if', '▁voc', 'ab', '▁is', '▁not', '▁None', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁#', '▁Let', '▁Un', 'ig', 'ram', '(', '..', ')', '▁fail', '▁if', '▁only', '▁one', '▁of', '▁them', '▁is', '▁None', '\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '(', 'voc', 'ab', '))\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁else', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [1, 946, 40517, 61803, 15343, 4952, 367, 1163, 12331, 7159, 61776, 9859, 12331, 7159, 2454, 61715, 61715, 61715, 7606, 59192, 61803, 15343, 2426, 367, 1163, 35304, 7159, 962, 61715, 61715, 61715, 4784, 13312, 273, 2426, 367, 1163, 10857, 61750, 437, 273, 1184, 13584, 2854, 1815, 597, 40517, 61803, 15343, 61755, 61715, 61715, 61715, 23853, 61715, 61715, 61715, 1178, 3771, 2999, 1390, 3488, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1349, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 26775, 403, 61799, 21329, 61846, 3412, 61846, 39253, 61846, 2572, 61750, 10030, 17241, 61847, 440, 4673, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 13564, 61799, 1944, 440, 591, 591, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1173, 61768, 17279, 61768, 5529, 61799, 7165, 440, 8620, 622, 61715, 61715, 61715, 39708, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 686, 26775, 403, 366, 618, 4673, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1478, 4593, 2426, 367, 1163, 61776, 786, 61775, 6272, 686, 1323, 882, 319, 1136, 366, 4673, 61755, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 61776, 44969, 403, 3630, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 2335, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 8170]
Len Tokens 226
Decoded: <s> class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = " ",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
- Emoji
Input : 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
Tokens: ['▁', '😜', '<0xF0>', '<0x9F>', '<0xAB>', '<0xA4>', '☹', '️', '😖', '🤢', '🤮', '😇', '<0xF0>', '<0x9F>', '<0x90>', '<0xBB>', '\u200d', '❄', '️', '🦄', '🐾', '<0xF0>', '<0x9F>', '<0x90>', '<0xBD>', '🐍', '<0xF0>', '<0x9F>', '<0xA6>', '<0x9E>', '<0xF0>', '<0x9F>', '<0xA6>', '<0x90>', '<0xF0>', '<0x9F>', '<0xA6>', '<0xBF>', '<0xF0>', '<0x9F>', '<0xA4>', '<0xB4>', '<0xF0>', '<0x9F>', '<0xA7>', '<0x91>', '\u200d', '<0xF0>', '<0x9F>', '<0xA6>', '<0xB2>', '👨', '\u200d', '<0xF0>', '<0x9F>', '<0x9A>', '<0x92>', '👨', '\u200d', '🚀']
Encoded: [1, 61715, 64694, 243, 162, 174, 167, 66250, 62096, 68719, 68725, 70665, 68209, 243, 162, 147, 190, 62658, 66107, 62096, 70672, 69452, 243, 162, 147, 192, 66921, 243, 162, 169, 161, 243, 162, 169, 147, 243, 162, 169, 194, 243, 162, 167, 183, 243, 162, 170, 148, 62658, 243, 162, 169, 181, 66362, 62658, 243, 162, 157, 149, 66362, 62658, 62748]
Len Tokens 61
Decoded: <s> 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
- Sanskrit
Input : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Tokens: ['▁ॐ', '▁त्र', '्यम', '्ब', 'क', 'ं', '▁य', 'जाम', 'हे', '▁सुग', 'न्', 'धि', 'ं', '▁पुष्ट', 'िव', 'र्', 'धन', 'म्', '▁उर्', 'वार', 'ुक', 'म', 'िव', '▁ब', 'न्', 'धन', 'ान', '्म', 'ृत', '्यो', 'र्म', 'ु', 'क्ष', 'ीय', '▁माम', 'ृता', 'त्', '▁ॐ', '.', '▁']
Encoded: [1, 29916, 5202, 1347, 9931, 61725, 61738, 411, 8036, 11065, 38930, 1063, 3255, 61738, 61196, 816, 326, 4408, 3315, 10236, 527, 1908, 61742, 816, 289, 1063, 4408, 325, 640, 1803, 4364, 911, 61763, 536, 2770, 940, 27338, 592, 29916, 61751, 61715]
Len Tokens 41
Decoded: <s> ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.