Fast Tokenizer
#17
by
JaronTHU
- opened
Can you provide a fast tokenizer like internlm2_5-7b-chat?
Thanks for your attention.
The current HF tokenizer doesn't fully support the features in sentencepiece, and we have adopted a new vocabulary construction method, involving multi-step construction and merging, which relies on features unsupported in HF fast tokenizer. Thus, we cannot provide a precise fast tokenizer based on the current HF tokenizer library.
In our testing, the speed of the normal tokenizer and fast tokenizer is not that large. If you need to tokenize a large corpus, I suggest adopting the data parallelism of tokenization.