Fix generation with latest transformers
#1
by
kylesayrs
- opened
Purpose
- Fix model generation
Related Issues
Changes
- The latest transformers release removed support for
past_key_values.get_max_length()
in favor ofpast_key_values.get_max_cache_shape()
- Add support for decoding tensors of ids, as is the typical output from generation
Testing
from transformers import AutoModelForCausalLM, AutoTokenizer
# Select model and load it.
MODEL_ID = "moonshotai/Moonlight-16B-A3B"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# # Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
kylesayrs
changed pull request title from
Fix DynamicCache with latest transformers
to Fix generation with latest transformers
kylesayrs
changed pull request status to
open