LlamaTokenizerFast.from_pretrained gives incorrect number of tokens for Llama3

#156
by farzadab - opened

I noticed that when I tried to load a Llama3 tokenizer by directly using the LlamaTokenizerFast class, I get an extra <unk> token that is not present in the original tokenizer.
I usually prefer to use the AutoTokenizer, but I always assumed they would be equivalent and I'm surprised at the inconsistency. This caused me a bug that was quite hard to figure out since I did not expect it at all.

I wasn't sure if I was supposed to create an issue for this or not.

Repro steps:

>>> import transformers
>>> tok1 = transformers.LlamaTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> print(len(tok1))
128257
>>> tok1.unk_token_id
128256
>>> tok2 = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> print(len(tok2))
128256
>>> tok2.unk_token_id   # returns None instead
Meta Llama org

Hey! This is expected, you should not be using the LlamaTokenizer[Fast] for Llama3 as the tokenizer is different. Either use PreTrainedTokenizerFast or AutoTokenizer.
What happens when you load using LlamaTokenizerFast is it adds the unk token (depending on the version of tokenizers you are using).

Thanks for the response!
I understand that, my point was that this is an easy mistake to make, specially if you're upgrading from Llama 2, and it is gonna be hard to catch.

My proposal was to either add a warning in LlamaTokenizer[Fast] or at least some comment in the Llama 3 page to indicate the importance of tokenizer change from Llama 2 to 3.
I can create the PR if you think that's the right approach.

so, when use AutoTokenizer to load tokenzier, how to set the "unk_token_id"(init is None)

Sign up or log in to comment