FIM-Tokens not marked special

#4
by ruediste - opened

Hi
I debugged the tokenizer stack for a few hours until I discovered that the FIM tokens are not marked special (<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>, etc). Any reason for this? Below an excerpt from tokenizer.json

{
      "id": 151660,
      "content": "<|fim_middle|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": false
    },

download tokenizer.json, tokenizer_config.json and vocab.json to directory: path\to\your\Qwen\Qwen2.5-Coder-7B

and exec code below

from transformers import AutoTokenizer

model_dir = r'path\to\your\Qwen\Qwen2.5-Coder-7B'
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_dir,
local_files_only=True,
)
tokenizer.add_tokens

Sign up or log in to comment