LlamaTokenizerFast.from_pretrained gives incorrect number of tokens for Llama3
I noticed that when I tried to load a Llama3 tokenizer by directly using the LlamaTokenizerFast
class, I get an extra <unk>
token that is not present in the original tokenizer.
I usually prefer to use the AutoTokenizer
, but I always assumed they would be equivalent and I'm surprised at the inconsistency. This caused me a bug that was quite hard to figure out since I did not expect it at all.
I wasn't sure if I was supposed to create an issue for this or not.
Repro steps:
>>> import transformers
>>> tok1 = transformers.LlamaTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> print(len(tok1))
128257
>>> tok1.unk_token_id
128256
>>> tok2 = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> print(len(tok2))
128256
>>> tok2.unk_token_id # returns None instead
Hey! This is expected, you should not be using the LlamaTokenizer[Fast]
for Llama3 as the tokenizer is different. Either use PreTrainedTokenizerFast
or AutoTokenizer
.
What happens when you load using LlamaTokenizerFast is it adds the unk token (depending on the version of tokenizers you are using).
Thanks for the response!
I understand that, my point was that this is an easy mistake to make, specially if you're upgrading from Llama 2, and it is gonna be hard to catch.
My proposal was to either add a warning in LlamaTokenizer[Fast]
or at least some comment in the Llama 3 page to indicate the importance of tokenizer change from Llama 2 to 3.
I can create the PR if you think that's the right approach.
so, when use AutoTokenizer to load tokenzier, how to set the "unk_token_id"(init is None)