Error in using this model for inference in Google Colab
Load model
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
Generate
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
This is the error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
You forgot to put the tokenized input on the gpu
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))
Output:
<s> Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The
Thank you so much it works now !!