Why are "add_bos_token" and "add_eos_token" missing in tokenizer_config.json ?
Without these two in the tokenizer_config.json, I find it impossible to initialize the Llama-3 tokenizer with disabled adding of the BOS token.
This behaves as expected:
from transformers import AutoTokenizer
llama2_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> llama2_tok("hello")
{'input_ids': [1, 22172], 'attention_mask': [1, 1]}
>>> llama3_tok("hello")
{'input_ids': [128000, 15339], 'attention_mask': [1, 1]}
As we can see here, BOS tokens are added correctly for both tokenizers.
Let's now try to disable adding of the BOS token and enable adding of the EOS token:
from transformers import AutoTokenizer
llama2_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_bos_token=False, add_eos_token=True)
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", add_bos_token=False, add_eos_token=True)
>>> llama2_tok("hello")
{'input_ids': [22172, 2], 'attention_mask': [1, 1]} <----- Good. BOS token not added, EOS token added.
>>> llama3_tok("hello")
{'input_ids': [128000, 15339], 'attention_mask': [1, 1]} <---- Not good. BOS token added, EOS not added.
As can be seen, Llama-3 completely ignored the given add_bos_token
and add_eos_token
.
From what I have been able to trace, this might be due to the missing add_bos_token
and add_eos_token
in the tokenizer_config.json
of the Llama-3 model.
Hey! This is unfortunately expected for now, and the template processor should be updated. If you check the class this used for both is not the same.
More details here: https://github.com/huggingface/transformers/issues/30947#issuecomment-2128057992