special tokens in prompt with ggml/examples/starcoder
#3
by
mljxy
- opened
Using the starcoder example in ggml, the special tokens in prompt does not got tokenized correctly. For example,
main: token[0] = 46, <
main: token[1] = 110, |
main: token[2] = 2946, system
main: token[3] = 28318, |>
The correct tokenization should map <|system|>
to 49152 instead. The same incorrect tokenizations happen to <|user|>
, <|assistant|>
, and <|end|>
.
This was fixed last week: https://github.com/ggerganov/ggml/commit/e456108433017d5586b35fd36ce781b4c3aed631
But only kinda-sorta fixed I think, there's still somethign up here I can't get SantaCoder to spit out token 49152 (<|end|>) the GGML inference diverges from what the HF model does.