Broken tokenizer
#77
by
anferico
- opened
@patrickvonplaten I think you broke the tokenizer by deleting "tokenizer.model". Now this throws an error:
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
File ".../lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
with open(self.vocab_file, "rb") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType
Glad to hear I'm not the only one bogged by this error lol. Here is a more complete code snippet for easier paste and run:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load with a specific commit hash (the one before deleting the `tokenizer.model`)
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
revision="f67d0f47df7707eddf3fb61000e3e8713074f45c"
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
revision="f67d0f47df7707eddf3fb61000e3e8713074f45c",
device_map="auto",
torch_dtype=torch.bfloat16,
)
Sorry about - I confused a HF tokenizer file with mistral common one. Reverted it - should work again :-)
patrickvonplaten
changed discussion status to
closed
@patrickvonplaten seems that I am still facing the same error
UPD: seems that some caching issue - tokenizer.model hasn't been downloaded due to default allow rules for hf_snapshot_download, solved now