Trained on 20x More Tokens than Previous Iterations

Byte Fallback BPE Tokenizer

  • Trained using google/SPM
  • Vocab Size : 72808

Training Args



def get_corpus_iterator():
  dataset = load_dataset("fhai50032/pds-tk-specific-2", split="train")
  shuffled = dataset.shuffle(seed=42)
  for text in shuffled["text"]:
      stripped = text.strip()
      if stripped:
          for i in range(0, len(stripped), 8192):
              yield stripped[i: i+8192]
sentence_iterator=get_corpus_iterator(),
model_prefix=tokenizer_name,
vocab_size=vocab_size,
num_threads=num_threads,
model_type="bpe",
max_sentence_length=8192,
character_coverage=1.0,
byte_fallback=True,
shuffle_input_sentence=True,
remove_extra_whitespaces=False,
normalization_rule_name="identity",

Special Tokens

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}

Training Composition:

  • Maths: 550 M "aluncstokes/mathpile_arxiv_subset"

  • Code: 800 M codeparrot/github-code

  • Hinglish : 250 M Abhishekcr448/Hinglish-Everyday-Conversations-1M Maihar/hinglish-80k

  • English : 2 000 M "allenai/c4", "en"

  • Hindi : 2 200 M aloobun/dhpileIN , data_dir='hi'

Evals

Tokenization Efficency (Less is Better)

Tokenizer English Hindi Tamil Bengali Malayalam Telugu Gujarati Punjabi Code_Python Code_Java c++ Math
0 deepseek-ai/DeepSeek-R1 (128k) 338874 22855 48957 39617 73928 40345 101020 79172 5231 2224 7055 5376
1 unsloth/phi-4 (100k) 308645 40456 59750 116122 149889 48689 118335 87413 4809 2110 6529 5573
2 deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) 308512 21110 59625 115138 149883 48661 118061 86765 4809 2111 6530 5574
3 unsloth/gemma-2-9b-it(256k) 323335 15916 53913 53402 57219 47610 107925 87222 5948 2569 8639 5871
4 Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) 366710 11447 61408 94191 97207 50229 117874 90045 8201 4000 13706 5585
> Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) 330830 10318 59089 93740 92655 44975 109411 87922 7819 3743 12953 5253
6 Ornaments/72k-TK-BBPE-HF (72k) 321274 10813 67585 159985 193813 55654 134397 97063 5225 2263 7090 5150
7 nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) 332271 14327 55473 36615 45783 48270 160115 117174 6186 2732 8861 6136
8 sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) 370133 15633 67845 120340 105953 68315 159122 113817 6595 2792 9233 6223
9 sarvamai/sarvam-1 (68k) 385386 11257 61396 27348 31822 51463 119666 103344 7331 3068 9724 6864

Encode-Decode

  • Hindi
Input  : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['▁ऋतुराज', '▁गायकवाड़', '▁(', 'क', 'प्तान', '),', '▁डे', 'व', 'ोन', '▁कॉ', 'न', 'वे', ',', '▁रच', 'िन', '▁रविंद्र', ',', '▁राहुल', '▁त्रिपाठी', ',', '▁शिवम', '▁दुबे', ',', '▁रविंद्र', '▁जडेजा', ',', '▁एमएस', '▁धोनी', '▁(', 'व', 'िकेट', 'कीपर', '),', '▁आर', '▁अश्विन', ',', '▁मी', 'था', 'शा', '▁पथ', 'िर', 'ाना', ',', '▁ख', 'लील', '▁अहमद', ',', '▁नूर', '▁अहमद', '।']
Encoded: [1, 34862, 26967, 435, 61725, 29148, 1099, 1945, 61754, 1769, 1777, 61735, 981, 61750, 18114, 465, 19049, 61750, 4310, 11042, 61750, 21132, 13133, 61750, 19049, 13624, 61750, 18436, 12473, 435, 61754, 1956, 14572, 1099, 2208, 17618, 61750, 3352, 2063, 2500, 12020, 731, 781, 61750, 429, 13121, 9490, 61750, 26786, 9490, 61770]
Len Tokens 51
Decoded: <s> ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
  • English
Input  : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['▁Bangalore', '▁and', '▁Chennai', '▁have', '▁faced', '▁each', '▁other', '▁in', '▁33', '▁matches', '▁in', '▁IPL', '.', '▁Out', '▁of', '▁these', '▁33', '▁games', ',', '▁Bangalore', '▁have', '▁won', '▁11', '▁whereas', '▁Chennai', '▁have', '▁come', '▁out', '▁vict', 'orious', '▁on', '▁21', '▁occasion', '.', '▁1', '▁match', '▁ended', '▁without', '▁a', '▁result', '.']
Encoded: [1, 43579, 317, 42140, 607, 21626, 1872, 1022, 313, 7736, 14838, 313, 9863, 61751, 7363, 319, 1517, 7736, 4837, 61750, 43579, 607, 4817, 1730, 22734, 42140, 607, 2968, 811, 9594, 30189, 395, 3209, 13423, 61751, 385, 5083, 13623, 2675, 262, 1773, 61751]
Len Tokens 42
Decoded: <s> Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
  • Math
Input  : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi

Tokens: ['▁%', '▁Change', '▁the', '▁font', '▁if', '▁you', '▁want', '▁to', ',', '▁depending', '▁on', '▁whether', '\n', '%', '▁you', "'", 're', '▁using', '▁pd', 'fl', 'ate', 'x', '▁or', '▁x', 'el', 'ate', 'x', '/', 'l', 'ual', 'ate', 'x', '\n', '%', '▁WH', 'EN', '▁COMP', 'IL', 'ING', '▁WITH', '▁X', 'EL', 'ATE', 'X', '▁PLEASE', '▁USE', '\n', '%', '▁x', 'el', 'ate', 'x', '▁-', 'shell', '-', 'escape', '▁-', 'output', '-', 'driver', '="', 'xd', 'v', 'ip', 'df', 'mx', '▁-', 'z', '▁0', '"', '▁sample', '.', 'tex', '\n\\', 'ift', 'ut', 'ex', '\n', '▁', '▁%', '▁If', '▁using', '▁x', 'el', 'ate', 'x', '▁or', '▁l', 'ual', 'ate', 'x', ':\n', '▁', '▁\\', 'set', 'main', 'font', '{', 'Rob', 'oto', '▁Sl', 'ab', '}\n', '▁', '▁\\', 'sets', 'ans', 'font', '{', 'L', 'ato', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'else', '\n', '▁', '▁%', '▁If', '▁using', '▁pd', 'fl', 'ate', 'x', ':\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'rm', ']{', 'rob', 'oto', '}\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}\n', '▁', '▁%', '▁\\', 'us', 'ep', 'ack', 'age', '{', 'sources', 'ans', 'pro', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'fi', '\n']
Encoded: [1, 2920, 20717, 273, 9731, 686, 380, 1570, 306, 61750, 11224, 395, 4910, 61755, 61863, 380, 61809, 265, 2138, 32887, 2673, 442, 61792, 506, 1894, 335, 442, 61792, 61804, 61729, 1204, 442, 61792, 61755, 61863, 19877, 1920, 27670, 4809, 3922, 25404, 2470, 8534, 6586, 61859, 61046, 22326, 61755, 61863, 1894, 335, 442, 61792, 777, 34320, 61780, 35727, 777, 9020, 61780, 16819, 696, 25014, 61762, 947, 7497, 35801, 777, 61831, 612, 61798, 10079, 61751, 8032, 1207, 2865, 388, 1096, 61755, 61715, 2920, 1608, 2138, 1894, 335, 442, 61792, 506, 334, 1204, 442, 61792, 1025, 61715, 426, 1106, 6972, 5295, 61782, 32606, 5896, 11751, 403, 499, 61715, 426, 8105, 770, 5295, 61782, 61811, 10464, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 7019, 61755, 61715, 2920, 1608, 2138, 32887, 2673, 442, 61792, 1025, 61715, 426, 379, 953, 697, 626, 61846, 1628, 6219, 35451, 5896, 499, 61715, 426, 379, 953, 697, 626, 61846, 41970, 770, 6219, 61729, 10464, 499, 61715, 2920, 426, 379, 953, 697, 626, 61782, 43733, 770, 1194, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 12885, 61755]
Len Tokens 185
Decoded: <s> % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi
  • Code
Input  : class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
Tokens: ['▁class', '▁Sentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(', 'Base', 'Token', 'izer', '):\n', '▁', '▁', '▁', '▁"""', 'Sentence', 'P', 'iece', '▁Un', 'ig', 'ram', '▁Token', 'izer', '\n\n', '▁', '▁', '▁', '▁Rep', 'resents', '▁the', '▁Un', 'ig', 'ram', '▁algorithm', ',', '▁with', '▁the', '▁pre', 'token', 'ization', '▁used', '▁by', '▁Sentence', 'P', 'iece', '\n', '▁', '▁', '▁', '▁"""\n\n', '▁', '▁', '▁', '▁def', '▁__', 'init', '__', '(\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁self', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁voc', 'ab', ':', '▁Optional', '[', 'List', '[', 'Tuple', '[', 'str', ',', '▁float', ']]', ']', '▁=', '▁None', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁replacement', ':', '▁str', '▁=', '▁"', '▁"', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁add', '_', 'prefix', '_', 'space', ':', '▁bool', '▁=', '▁True', ',\n', '▁', '▁', '▁', '▁):\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁if', '▁voc', 'ab', '▁is', '▁not', '▁None', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁#', '▁Let', '▁Un', 'ig', 'ram', '(', '..', ')', '▁fail', '▁if', '▁only', '▁one', '▁of', '▁them', '▁is', '▁None', '\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '(', 'voc', 'ab', '))\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁else', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [1, 946, 40517, 61803, 15343, 4952, 367, 1163, 12331, 7159, 61776, 9859, 12331, 7159, 2454, 61715, 61715, 61715, 7606, 59192, 61803, 15343, 2426, 367, 1163, 35304, 7159, 962, 61715, 61715, 61715, 4784, 13312, 273, 2426, 367, 1163, 10857, 61750, 437, 273, 1184, 13584, 2854, 1815, 597, 40517, 61803, 15343, 61755, 61715, 61715, 61715, 23853, 61715, 61715, 61715, 1178, 3771, 2999, 1390, 3488, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1349, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 26775, 403, 61799, 21329, 61846, 3412, 61846, 39253, 61846, 2572, 61750, 10030, 17241, 61847, 440, 4673, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 13564, 61799, 1944, 440, 591, 591, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1173, 61768, 17279, 61768, 5529, 61799, 7165, 440, 8620, 622, 61715, 61715, 61715, 39708, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 686, 26775, 403, 366, 618, 4673, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1478, 4593, 2426, 367, 1163, 61776, 786, 61775, 6272, 686, 1323, 882, 319, 1136, 366, 4673, 61755, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 61776, 44969, 403, 3630, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 2335, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 8170]
Len Tokens 226
Decoded: <s> class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = " ",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
  • Emoji
Input  : 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
Tokens: ['▁', '😜', '<0xF0>', '<0x9F>', '<0xAB>', '<0xA4>', '☹', '️', '😖', '🤢', '🤮', '😇', '<0xF0>', '<0x9F>', '<0x90>', '<0xBB>', '\u200d', '❄', '️', '🦄', '🐾', '<0xF0>', '<0x9F>', '<0x90>', '<0xBD>', '🐍', '<0xF0>', '<0x9F>', '<0xA6>', '<0x9E>', '<0xF0>', '<0x9F>', '<0xA6>', '<0x90>', '<0xF0>', '<0x9F>', '<0xA6>', '<0xBF>', '<0xF0>', '<0x9F>', '<0xA4>', '<0xB4>', '<0xF0>', '<0x9F>', '<0xA7>', '<0x91>', '\u200d', '<0xF0>', '<0x9F>', '<0xA6>', '<0xB2>', '👨', '\u200d', '<0xF0>', '<0x9F>', '<0x9A>', '<0x92>', '👨', '\u200d', '🚀']
Encoded: [1, 61715, 64694, 243, 162, 174, 167, 66250, 62096, 68719, 68725, 70665, 68209, 243, 162, 147, 190, 62658, 66107, 62096, 70672, 69452, 243, 162, 147, 192, 66921, 243, 162, 169, 161, 243, 162, 169, 147, 243, 162, 169, 194, 243, 162, 167, 183, 243, 162, 170, 148, 62658, 243, 162, 169, 181, 66362, 62658, 243, 162, 157, 149, 66362, 62658, 62748]
Len Tokens 61
Decoded: <s> 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
  • Sanskrit
Input  : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Tokens: ['▁ॐ', '▁त्र', '्यम', '्ब', 'क', 'ं', '▁य', 'जाम', 'हे', '▁सुग', 'न्', 'धि', 'ं', '▁पुष्ट', 'िव', 'र्', 'धन', 'म्', '▁उर्', 'वार', 'ुक', 'म', 'िव', '▁ब', 'न्', 'धन', 'ान', '्म', 'ृत', '्यो', 'र्म', 'ु', 'क्ष', 'ीय', '▁माम', 'ृता', 'त्', '▁ॐ', '.', '▁']
Encoded: [1, 29916, 5202, 1347, 9931, 61725, 61738, 411, 8036, 11065, 38930, 1063, 3255, 61738, 61196, 816, 326, 4408, 3315, 10236, 527, 1908, 61742, 816, 289, 1063, 4408, 325, 640, 1803, 4364, 911, 61763, 536, 2770, 940, 27338, 592, 29916, 61751, 61715]
Len Tokens 41
Decoded: <s> ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.