Tokenizer config is wrong

#10

by stoshniwal - opened Jan 21

Jan 21

•

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/2d78713b01ecefe27a89fafec248a5dfd731396f/tokenizer_config.json#L33

LlamaTokenizerFast -> Qwen2Tokenizer

JaheimLee

Jan 22

Qwen always uses Qwen2Tokenizer.

stoshniwal

Jan 22

•

edited Jan 22

Sorry updated the tokenizer class in the first comment. The current tokenizer config states the tokenizer class as LlamaTokenizerFast.

jsalix

Jan 22

@bartowski sorry if this is something you were already aware of, could this be causing some of the issues on local usage? I checked and it seems all the Qwen-based distills have the same Llama tokenizer class instead of the Qwen one used on the respective base models

bartowski

Jan 23

It seeeeems unlikely, just since llama.cpp uses its own tokenizer, however it is possible that the existing conversion code was based on an incorrect tokenizer

But that should still not be a problem with the final result I think

I've seen people have better results with lower temperature and proper prompting

@ngxson any thoughts?

ngxson

Jan 23

For GGUF the tokenizer is defined by Model class, not Tokenizer class, so it's not important what is the value in tokenizer_config.json

bartowski

Jan 23

That's what I thought, thanks for confirming!

Fizzarolli

Jan 24

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/discussions/4
I think all the qwen ones may or may not be completely busted and have the wrong tokenizer config and special tokens (both in lcpp and transforemrs) :/

jamesbraza

26 days ago

To share, here's a separate reason the tokenizer config is dangerous: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/discussions/21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment