Tokenizer not adding BOS

#4
by andreasgrv - opened

Hi,

According to the generation config of this model:

{
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "transformers_version": "4.39.3"
}

There is a BOS with ID=1. However, with transformers==4.49.0 :

from transformers import AutoTokenizer
model = "HuggingFaceFW/ablation-model-fineweb-edu"
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.encode('', return_tensors='pt', add_special_tokens=True)

Out[63]: tensor([], size=(1, 0))

Is this expected? (I would expect BOS to be added). I checked also if this changes if I set use_fast=False, but nothing changes.

Sign up or log in to comment