Vocab size in config.json mismatches the actual tokenizer size

by Fizzarolli - opened Jan 24

Jan 24

•

The tokenizer.json has 151665 tokens, the config.json has 152064 (ie the same size as the original Qwen tokenizer, not the new one)
Broken config.json?

`tokenizer.json` 文件中有 151,665 个标记，而 `config.json` 中却有 152,064 个标记（即与原始通义千问分词器的规模相同，而非新版）这是否意味着 `config.json` 文件存在配置错误？

Fizzarolli

Jan 24

•

edited Jan 24

Wait, checking more, now I'm even more confused... the vocab size in the model weights themselves (ie the LM head shape) and the one in the config.json are the same... but the uploaded tokenizer is different...
Was it trained with the Qwen tokenizer, not the Deepseek one used for most of the other models?

稍等，进一步检查后我更加困惑了... 模型权重本身的词表大小与 `config.json` 中的数值是一致的... 但上传的分词器却显示不同... 难道这个模型实际使用的是通义千问的分词器训练，而非像其他多数模型那样采用深度求索（Deepseek）的分词器？

Fizzarolli

Jan 24

•

edited Jan 24

oh god all the qwen ones are broken / 天啊！所有通义千问（Qwen）模型配置都出问题了！

YellowDoge

Jan 25

Hi @Fizzarolli , the vocab size in config.json indicates the size of the input and output embedding, which could be larger than the number of tokens in the tokenizer. The addtional embeddings are not trained and would not affect the model performance.

Fizzarolli

Jan 25

That's good!, But the mismatch causes issues with some training/inference backends, though (ie I tried to tune the model in Axolotl, it made a weird LM head layer and didn't actually work properly after merged because of how it made the head layers)

Nadav-Timor

21 days ago

Hi @Fizzarolli , the vocab size in config.json indicates the size of the input and output embedding, which could be larger than the number of tokens in the tokenizer. The addtional embeddings are not trained and would not affect the model performance.

@YellowDoge , what do you mean by "not affect the model performance"? What guarantees that the logits assign zero probability to these extra tokens?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment