Vocab size in config.json mismatches the actual tokenizer size

#4
by Fizzarolli - opened

The tokenizer.json has 151665 tokens, the config.json has 152064 (ie the same size as the original Qwen tokenizer, not the new one)
Broken config.json?


`tokenizer.json` 文件中有 151,665 个标记,而 `config.json` 中却有 152,064 个标记(即与原始通义千问 分词器的规模相同,而非新版) 这是否意味着 `config.json` 文件存在配置错误?

Wait, checking more, now I'm even more confused... the vocab size in the model weights themselves (ie the LM head shape) and the one in the config.json are the same... but the uploaded tokenizer is different...
Was it trained with the Qwen tokenizer, not the Deepseek one used for most of the other models?


稍等,进一步检查后我更加困惑了... 模型权重本身的词表大小与 `config.json` 中的数值是一致的... 但上传的分词器却显示不同... 难道这个模型实际使用的是通义千问的分词器训练,而非像其他多数模型那样采用深度求索(Deepseek)的分词器?

oh god all the qwen ones are broken / 天啊!所有通义千问(Qwen)模型配置都出问题了!

Hi @Fizzarolli , the vocab size in config.json indicates the size of the input and output embedding, which could be larger than the number of tokens in the tokenizer. The addtional embeddings are not trained and would not affect the model performance.

That's good!, But the mismatch causes issues with some training/inference backends, though (ie I tried to tune the model in Axolotl, it made a weird LM head layer and didn't actually work properly after merged because of how it made the head layers)

Sign up or log in to comment