FIM-Tokens not marked special
#4
by
ruediste
- opened
Hi
I debugged the tokenizer stack for a few hours until I discovered that the FIM tokens are not marked special (<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>, etc). Any reason for this? Below an excerpt from tokenizer.json
{
"id": 151660,
"content": "<|fim_middle|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": false
},
download tokenizer.json, tokenizer_config.json and vocab.json to directory: path\to\your\Qwen\Qwen2.5-Coder-7B
and exec code below
from transformers import AutoTokenizer
model_dir = r'path\to\your\Qwen\Qwen2.5-Coder-7B'
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_dir,
local_files_only=True,
)
tokenizer.add_tokens