Fix chat template

#56

by wonderboy - opened Jul 30

base: refs/heads/main

←

from: refs/pr/56

Discussion Files changed

-1

wonderboy

Jul 30

When using apply_chat_template function, an extra space is added after "[INST]", which original Mistral code does not. After removing extra space from Jinja template, the issue seems to have been solved.

Fix chat template2e7ef1b6

pandora-s

Mistral AI_ org Jul 31

Whats important is the tokenizer from the HF transformers implementation to match one on one mistral-common as its the ground truth, does the current one not match from your experiments? 🤔

wonderboy

Jul 31

•

edited Jul 31

To keep it simple, both tokenizers encoded tokens match given both handle the same text, eg: <s>[INST]Hello[/INST], so this isn't a tokenizer issue. However, you do get an extra space when the HF apply_chat_template function is used, so instead of getting the same text as the previous example, the HF apply_chat_template function returns <s>[INST] Hello[/INST], adding space after the [INST] token. This is fixed by editing the Ninja template used.

pandora-s

Mistral AI_ org Jul 31

Mistral common provides both a Debug string and the Encoded tokens, so Im curious to know if they both match or if this new implementation will make the debug string not match. Usually we run a test script where we compare the tokenizer with both the encoded tokens and debug strings to match one-on-one with mistral_common.

wonderboy

Jul 31

•

edited Jul 31

If you look at the HF apply_chat_template function, it first applies the Ninja template so text is converted from Hello to <s>[INST] Hello[/INST], and then sent to the tokenizer, so I think it should pass the test, given no change is done to the actual HF tokenizer.

pandora-s

Mistral AI_ org Jul 31

After some manual confirmation you are 100% right, the current chat template should not have a space, seems like there was some mix in with the old v3 vs v3 tekken !

I will do some validations and will most likely merge this once everything seems fine, thank you!

Update tokenizer_config.json666e097f

pandora-s

Mistral AI_ org Jul 31

looks perfect now! Thank you again, will be merging this

pandora-s changed pull request status to merged Jul 31

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment