The correctness of the result using transformers apply_chat_template

#92
by Annorita - opened

In the latest transformers (4.34.0), they have a function called "apply_chat_template" that allows us to get the prompt. For example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")
chat = [
  {"role": "user", "content": "USER_INSTRUCTION_1"},
  {"role": "assistant", "content": "RESPONSE_1"},
  {"role": "user", "content": "USER_INSTRUCTION_2"},
  {"role": "assistant", "content": "RESPONSE_2"},
]
res = tokenizer.apply_chat_template(chat, tokenize=False)

Falcon does not have its own tokenizer class, so transformers will directly call the PreTrainedTokenizerFast and apply the following template:
"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}"

(check here: default_chat_template)
So the result of the example will be:

'<|im_start|>user\nUSER_INSTRUCTION_1<|im_end|>\n<|im_start|>assistant\nRESPONSE_1<|im_end|>\n<|im_start|>user\nUSER_INSTRUCTION_2<|im_end|>\n<|im_start|>assistant\nRESPONSE_2<|im_end|>\n'

However, if we encode this sentence and then decode them back, we can find that the tokenizer cannot recognize the special tokens such as <|im_start|>:

res_space = '<|im_start|>'
ids = tokenizer.encode(res_space)
tmp = []
for id in ids:
    tmp.append(tokenizer.decode(id))
#tmp = ['<', '|', 'im', '_', 'start', '|>']

Is this the right template for us to use Falcon model?

No, that's the default template in tokenizers.
Proper format (...well, at least somewhat official): https://huggingface.co/tiiuae/falcon-7b-instruct/discussions/1#64708b0a3df93fddece002a4
Apparently the model wasn't trained on any concise format, so it seems like "whatever works". The format is the whole point of instruct training, and I really do not know why so many model trainers do not properly share the used format...

Sign up or log in to comment