Missing tokenizer.model

by Jeronymous - opened Feb 20, 2024

Feb 20, 2024

Hello,
Congrats and thanks for sharing CroissantLLM :)

We noticed that "tokenizer.model" file is missing, which can cause issues in some workflows.
See for instance https://github.com/huggingface/transformers/issues/29137

manu

CroissantLLM org Jun 28, 2024

Hello !
I think we really only ever kept the fast version of the tokenizer (use_fast = True) and never had to rely on the original sentencepiece tokenizer.model standard...

This is similar as what is done in https://huggingface.co/meta-llama/Meta-Llama-3-8B/.

I don't have any more files than you sadly...

https://github.com/huggingface/transformers/issues/21289

Did you manage to solve this on your end ?

manu changed discussion status to closed Jun 28, 2024

Jeronymous

Jul 11, 2024

Testing CroissantLLM with slow encoder was a one-off need.
I just mentioned the possible error here, in case it was unintentional.
But I think it's fine like this. There's no real practical use for the slow tokenizer (instead of the fast one).

Thank you !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment