Missing tokenizer.model

#4
by Jeronymous - opened

Hello,
Congrats and thanks for sharing CroissantLLM :)

We noticed that "tokenizer.model" file is missing, which can cause issues in some workflows.
See for instance https://github.com/huggingface/transformers/issues/29137

CroissantLLM org

Hello !
I think we really only ever kept the fast version of the tokenizer (use_fast = True) and never had to rely on the original sentencepiece tokenizer.model standard...

This is similar as what is done in https://huggingface.co/meta-llama/Meta-Llama-3-8B/.

I don't have any more files than you sadly...

https://github.com/huggingface/transformers/issues/21289

Did you manage to solve this on your end ?

manu changed discussion status to closed

Testing CroissantLLM with slow encoder was a one-off need.
I just mentioned the possible error here, in case it was unintentional.
But I think it's fine like this. There's no real practical use for the slow tokenizer (instead of the fast one).

Thank you !

Sign up or log in to comment