special tokens in prompt with ggml/examples/starcoder

#3
by mljxy - opened

Using the starcoder example in ggml, the special tokens in prompt does not got tokenized correctly. For example,

main: token[0] =     46, <                                                                                                                                                                    
main: token[1] =    110, |                                                                                                                                                                    
main: token[2] =   2946, system                                                                                                                                                               
main: token[3] =  28318, |>                                                                                                                                                                   

The correct tokenization should map <|system|> to 49152 instead. The same incorrect tokenizations happen to <|user|>, <|assistant|>, and <|end|>.

This was fixed last week: https://github.com/ggerganov/ggml/commit/e456108433017d5586b35fd36ce781b4c3aed631

But only kinda-sorta fixed I think, there's still somethign up here I can't get SantaCoder to spit out token 49152 (<|end|>) the GGML inference diverges from what the HF model does.

Sign up or log in to comment