Tokenizer model in SentencePiece format

#8
by vubiqus - opened

Hello,

I would be interested in using the tokenizer that is supposedly a SentencePiece tokenizer. I think the tokenizer is an important piece of the croissant project. Unfortunately, there is no file corresponding to the SentencePiece format in the repo. Hence, we cannot use the tokenizer outside of huggingface.

In llama2, the sentencepiece format of the tokenizer is saved as tokenizer.model
Here is another discussion open about the availability of the tokenizer.model in croissantllm: https://huggingface.co/croissantllm/CroissantLLMBase/discussions/4 .

Is it possible to release the tokenizer in the sentencepiece save format ?

CroissantLLM org

Hello !
I think we really only ever kept the fast version of the tokenizer (use_fast = True) and never had to rely on the original sentencepiece tokenizer.model standard...

This is similar as what is done in https://huggingface.co/meta-llama/Meta-Llama-3-8B/.

I don't have any more files than you sadly...

https://github.com/huggingface/transformers/issues/21289

manu changed discussion status to closed

Sign up or log in to comment