Tokenizer is incorrectly tokenizing '<|im_start|>' and '<|im_end|>' as strings

#5
by Light4Bear - opened

'<|im_start|>' and '<|im_end|>' are not marked as special tokens, so they are tokenized as strings instead of a single special token.

>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("models/jondurbin_bagel-34b-v0.4")
>>> tokenizer.decode([6])
'<|im_start|>'
>>> tokenizer.encode(tokenizer.decode([6]))
[59666, 59705, 622, 59593, 5858, 46826]

So is 1/4 of the training done on the wrong tokenization?
I do find the response when using ChatML is wrose than Alpaca.

Sign up or log in to comment