When using apply_chat_template function, an extra space is added after "[INST]", which original Mistral code does not. After removing extra space from Jinja template, the issue seems to have been solved.

Mistral AI_ org

Whats important is the tokenizer from the HF transformers implementation to match one on one mistral-common as its the ground truth, does the current one not match from your experiments? 🤔

To keep it simple, both tokenizers encoded tokens match given both handle the same text, eg: <s>[INST]Hello[/INST], so this isn't a tokenizer issue. However, you do get an extra space when the HF apply_chat_template function is used, so instead of getting the same text as the previous example, the HF apply_chat_template function returns <s>[INST] Hello[/INST], adding space after the [INST] token. This is fixed by editing the Ninja template used.

Mistral AI_ org

Mistral common provides both a Debug string and the Encoded tokens, so Im curious to know if they both match or if this new implementation will make the debug string not match. Usually we run a test script where we compare the tokenizer with both the encoded tokens and debug strings to match one-on-one with mistral_common.

If you look at the HF apply_chat_template function, it first applies the Ninja template so text is converted from Hello to <s>[INST] Hello[/INST], and then sent to the tokenizer, so I think it should pass the test, given no change is done to the actual HF tokenizer.

Mistral AI_ org

After some manual confirmation you are 100% right, the current chat template should not have a space, seems like there was some mix in with the old v3 vs v3 tekken !
image.png

I will do some validations and will most likely merge this once everything seems fine, thank you!

Mistral AI_ org

image.png
looks perfect now! Thank you again, will be merging this

pandora-s changed pull request status to merged

Sign up or log in to comment