Difference Behavior of Mistral Tokenizer and Huggingface Tokenizer

#58
by magic282 - opened

test case:

messages = [
     {"role": "system", "content": "You are helpful assistant."},
     {"role": "user", "content": "Hello."},
     {"role": "assistant", "content": "Hello there!"},
     {"role": "user", "content": "Who is Trump?"},
 ]

Mistral Tokenzier:

m_messages = []
for m in messages:
    if m['role'] == 'user':
        m_messages.append(UserMessage(content=m['content']))
    elif m['role'] == 'assistant':
        m_messages.append(AssistantMessage(content=m['content']))
    elif m['role'] == 'system':
        m_messages.append(SystemMessage(content=m['content']))
completion_request = ChatCompletionRequest(messages=m_messages)

tokens = tokenizer.encode_chat_completion(completion_request).tokens

output:

[1, 3, 22177, 1046, 4, 22177, 2156, 1033, 2, 3, 4568, 1584, 20351, 27089, 1338, 31500, 1395, 22279, 1063, 4]

Huggingface Tokenizer:

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

output:

tensor([[    1,     3, 45383,  1046,     4, 45383,  2156,  1033,     2,     3,
          3213,  1584, 20351, 27089,  1338, 31500,  1395, 22279,  1063,     4]],
       device='cuda:0')

The difference is actually the output of HF tokenizer starts with space after token id 3.

Is this expected?

Mistral AI_ org

Hi @magic282 , I just fixed this yesterday, do you still have the same error with the current updated one?

@pandora-s
I used an older version. Just pulled the latest and they are the same (as Mistral Tokenizer) now:

tensor([[    1,     3, 22177,  1046,     4, 22177,  2156,  1033,     2,     3,
          4568,  1584, 20351, 27089,  1338, 31500,  1395, 22279,  1063,     4]],
       device='cuda:0')
magic282 changed discussion status to closed

Sign up or log in to comment