Loading SPM tokenizer shows 32000 vocab size instead of 32064
The Phi3 paper claims that the Phi3 Mini sentencepiece tokenizer has a vocab size of 32064. However, when I load the tokenizer using the following code, I see that the vocab size for the saved model is only 32000.
>>> from sentencepiece import SentencePieceProcessor
>>> tokenizer = SentencePieceProcessor()
>>> tokenizer.load(PATH_TO_TOKENIZER_MODEL)
>>> tokenizer.vocab_size()
32000
What am I doing wrong here? And also how does this work correctly in HF?
I also manually examined the tokenizer.json
file, which only includes piece_ids up to 31999.
I see through 32010 in the added_tokens key
True, but still not 32064, which means embedding size should be wrong and fail.
The base tokenizer has 32000 tokens + 10 additional tokens = 32010.
The nearest multiple of 64 to 32010 is 32064, which provides massive benefits when running through Ampere or Hopper hardware: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html