Broken tokenizer

#77
by anferico - opened

@patrickvonplaten I think you broke the tokenizer by deleting "tokenizer.model". Now this throws an error:

 tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
File ".../lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
  with open(self.vocab_file, "rb") as f:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

as a temporary fix I set the tokenizer revision to f67d0f47df7707eddf3fb61000e3e8713074f45c

Glad to hear I'm not the only one bogged by this error lol. Here is a more complete code snippet for easier paste and run:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load with a specific commit hash (the one before deleting the `tokenizer.model`)
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    revision="f67d0f47df7707eddf3fb61000e3e8713074f45c"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    revision="f67d0f47df7707eddf3fb61000e3e8713074f45c",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
Mistral AI_ org

Sorry about - I confused a HF tokenizer file with mistral common one. Reverted it - should work again :-)

patrickvonplaten changed discussion status to closed

@patrickvonplaten seems that I am still facing the same error

UPD: seems that some caching issue - tokenizer.model hasn't been downloaded due to default allow rules for hf_snapshot_download, solved now

Sign up or log in to comment