Incomplete tokenizer conversion?
#3
by
anisse
- opened
This gguf conversion seems to not have the same properties as other llama-based BPE tokenizers. In particular, many ascii and valid unicode characters are impossible to tokenize. I created a simple program to illustrate the issue:
https://github.com/ggerganov/llama.cpp/pull/6988
This also exposes a limitation of llama.cpp: when it cannot tokenize something, it does not use the <unk> token, but crashes.
I haven't verified if this croissantllm-chat tokenizer limitation is specific to this gguf conversion or if it's also in the original.
Any updates on this? I can't even load the croissant GGUF models with llama.cpp right now. I'm trying to load croissantllmchat-v0.1.Q8_0.gguf but no success.