Why is "bos_token": null, in tokenizer_config.json?
Why is the "bos_token": null, in tokenizer_config.json?
I don't understand the reason for this line "bos_token": null,
within tokenizer_config.json
Please help us understand your reasons. I would simply replace it with the BOS as expected but I do not want to assume this would be the correct choice.
@hzhwcmhf
@yangapku
I also saw a discrepancy between the bos_token
and bos_token_id
settings in the tokenizer_config.json
and config.json
files during continued pre-training, which led to an error. I resolved the issue by setting the bos_token
in tokenizer_config.json
like you described. I've also reported the issue in the discussion here.
Iโm also interested in hearing the official response from the Qwen team. If this is a bug, it would be better to have it addressed.
I did a lot of research and tests today and null
is valid, depending on what you want to do with the model. For instance, if you quantize the model with llama.cpp, and it encounters null
as eos_token
or bos_token
in tokenizer_config.json, it will automatically fall back to bos_token_id
and eos_token_id
in the config.json file.
The fall-back hierarchy seems to be thus: tokenizer_config.json > config.json > generation_config.json. At least for quantizing with llama.cpp.
Note, that based on https://github.com/huggingface/transformers/issues/25395#issuecomment-1671075332, it is known that "in the past, the model config held both model parameters (like number of layers) and generate parameterization (like forcing tokens at generate time). That is suboptimal, as you may e.g. wish to have several generation configurations for the same model.", unless this info is already out of date.
There is no bos_token
for Qwen models. It is not necessary to prepend a control token to every input sequence. However, there are many frameworks assuming that there is a bos_token
and they indeed prepend a control token to every input sequence. If that is the case, we recommend setting it to <|endoftext|>
because most of time it takes no effect so it does less harm. However, if one is willing to investigate, it is better to check the data processing procedure to make sure no other assumptions are there and modify it so that a bos_token
is not needed.
that's to say:
- as a standalone model as well as a standalone tokenizer (tokenizer_config.json), the
bos_token
should benull
orNone
, which is the original - as a part of a framework, as in
transformers
(generation_config.json or the legacy config.json) which requires thebos_token
to function, thebos_token
is recommended to set to<|endoftext|>
; this is purely for compatibility
@tanliboy
trl
has supported models without a bos_token
in this PR.
@ThiloteE
there is a meta field called tokenizer.ggml.add_bos_token
in the GGUF files, and when converting Qwen models, you should set it to false
.
@jklj077 Thank you for the clarification. I suspected this might be the case.
For the record:
adding "add_bos_token": false
to the tokenizer_config.json sets tokenizer.ggml.add_bos_token
to false
during quantization to a GGUF file with the convert_hf_to_gguf.py
script as provided by llama.cpp.
I have provided a GGUF with the corrected config at https://huggingface.co/GPT4All-Community/Qwen2-7B-Instruct-GGUF