Add <|im_start|> as a special token to tokenizer_config.json

#4
by bartowski - opened

This fixes tokenization of the im_start token

Hi πŸ‘‹ @bartowski Hello, thank you very much! Could I see how you are specifically using it (for example, the inference code)? This would help us accurately reproduce your issue. Thanks again!

Ah interesting, I actually downloaded the model and using AutoTokenizer from transformers it does tokenize correctly..

However, with GGUF, with this missing it causes <|im_start|> to tokenize as:

59666 -> '<'
59705 -> '|'
  622 -> 'im'
59593 -> '_'
 5858 -> 'start'
46826 -> '|>'

which causes degraded generation. After this change, GGUF seems to be happy to tokenize im_start as token 6.. Not sure why it breaks llama.cpp and not transformers, but there you have it! Up to you if you want to include it or not then :)

Thank you very much for your contribution, @bartowski .

haijian06 changed pull request status to merged

Sign up or log in to comment