LlamaTokenizerFast.from_pretrained gives incorrect number of tokens for Llama3

#156

by farzadab - opened May 23, 2024

May 23, 2024

I noticed that when I tried to load a Llama3 tokenizer by directly using the LlamaTokenizerFast class, I get an extra <unk> token that is not present in the original tokenizer.
I usually prefer to use the AutoTokenizer, but I always assumed they would be equivalent and I'm surprised at the inconsistency. This caused me a bug that was quite hard to figure out since I did not expect it at all.

I wasn't sure if I was supposed to create an issue for this or not.

Repro steps:

>>> import transformers
>>> tok1 = transformers.LlamaTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> print(len(tok1))
128257
>>> tok1.unk_token_id
128256
>>> tok2 = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> print(len(tok2))
128256
>>> tok2.unk_token_id   # returns None instead

ArthurZ

Meta Llama org May 27, 2024

Hey! This is expected, you should not be using the LlamaTokenizer[Fast] for Llama3 as the tokenizer is different. Either use PreTrainedTokenizerFast or AutoTokenizer.
What happens when you load using LlamaTokenizerFast is it adds the unk token (depending on the version of tokenizers you are using).

farzadab

May 28, 2024

Thanks for the response!
I understand that, my point was that this is an easy mistake to make, specially if you're upgrading from Llama 2, and it is gonna be hard to catch.

My proposal was to either add a warning in LlamaTokenizer[Fast] or at least some comment in the Llama 3 page to indicate the importance of tokenizer change from Llama 2 to 3.
I can create the PR if you think that's the right approach.

HandsomeWu666

Sep 20, 2024

so, when use AutoTokenizer to load tokenzier, how to set the "unk_token_id"(init is None)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment