Fix generation with latest transformers

#1

Purpose

  • Fix model generation

Related Issues

Changes

  • The latest transformers release removed support for past_key_values.get_max_length() in favor of past_key_values.get_max_cache_shape()
  • Add support for decoding tensors of ids, as is the typical output from generation

Testing

from transformers import AutoModelForCausalLM, AutoTokenizer

# Select model and load it.
MODEL_ID = "moonshotai/Moonlight-16B-A3B"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# # Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
kylesayrs changed pull request title from Fix DynamicCache with latest transformers to Fix generation with latest transformers
kylesayrs changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment