Fast Tokenizer

#17
by JaronTHU - opened

Can you provide a fast tokenizer like internlm2_5-7b-chat?

InternLM org

Thanks for your attention.
The current HF tokenizer doesn't fully support the features in sentencepiece, and we have adopted a new vocabulary construction method, involving multi-step construction and merging, which relies on features unsupported in HF fast tokenizer. Thus, we cannot provide a precise fast tokenizer based on the current HF tokenizer library.
In our testing, the speed of the normal tokenizer and fast tokenizer is not that large. If you need to tokenize a large corpus, I suggest adopting the data parallelism of tokenization.

Sign up or log in to comment