Error in using this model for inference in Google Colab

#1
by sudhir2016 - opened

Load model
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Generate
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

This is the error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

sudhir2016 changed discussion title from Error in using this model for inference on in Google Colab to Error in using this model for inference in Google Colab
Mobius Labs GmbH org

You forgot to put the tokenized input on the gpu

model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

Output:

<s> Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The

Thank you so much it works now !!

sudhir2016 changed discussion status to closed

it run colab t4

model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Move the model and its rotary embedding to the GPU

model.to('cuda')
model.model.rotary_emb.to('cuda') # Explicitly move rotary_emb to GPU

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Fetching 8 files: 100%
 8/8 [00:00<00:00, 141.48it/s]
/usr/local/lib/python3.11/dist-packages/hqq/models/base.py:237: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32/32 [00:00<00:00, 577.45it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32/32 [00:00<00:00, 1108.60it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The

not run

import torch, os

cache_path = '.'
compute_dtype = torch.float16
device = 'cuda:0'
###################################################################################################
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

model_id = "meta-llama/Llama-2-7b-hf"

#Basic
#Linear layers will use the same quantization config
quant_config = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default

#Each type of linear layer (referred to as linear tag) will use different quantization parameters

q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}

q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}

quant_config = HqqConfig(dynamic_config={

'self_attn.q_proj':q4_config,

'self_attn.k_proj':q4_config,

'self_attn.v_proj':q4_config,

'self_attn.o_proj':q4_config,

'mlp.gate_proj':q3_config,

'mlp.up_proj' :q3_config,

'mlp.down_proj':q3_config,

})

#####################################################################################################

model = AutoModelForCausalLM.from_pretrained(model_id,
cache_dir=cache_path,
torch_dtype=compute_dtype,
device_map="auto", #device
low_cpu_mem_usage=True,
quantization_config=quant_config
)

#Set backend
from hqq.core.quantize import *
from hqq.core.utils import cleanup
HQQLinear.set_backend(HQQBackend.ATEN)

#Forward
with torch.no_grad():
out = model(torch.zeros([1, 1024], device=device, dtype=torch.int32)).logits

print(out)
del out
cleanup()

without do sampl

prompt = "Who is Einstein?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300)[0]
print(tokenizer.decode(generate_ids))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
Who is Einstein?
surely you know Einstein, the famous scientist who discovered the theory of relativity.
Einstein is a scientist who discovered the theory of relativity.
Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scien

with do_sample=True

prompt = "Who is Einstein?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0] # Add do_sample=False here
print(tokenizer.decode(generate_ids))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
Who is Einstein?
β˜‰ ☿ Lunar eclipse ♃♑ Uranus-Ceres conjunction β˜‰ ☿
Birthday: 1879, March 14th
Birthplace: Ulm, Germany
Father : Hermann Einstein (1847–1902), engineer
Mother : Pauline Koch (1858–1920), homemaker
The most famous of all physicists, he was a revolutionary thinker who has profoundly influenced our understanding of nature, of ourselves, and of the world in which we live. He devised the general theory of relativity, one of the two pillars of modern physics (alongside quantum theory). He is widely recognized as one of the greatest and most influential scientists of all times. Einstein was a theoretical physicist and one of the key pioneers of our current understanding of both general relativity and quantum mechanics. He received the 1921 Nobel Prize for Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect". As a child, Einstein was considered slow in schoolβ€”not because he was slow-witted, but because his own individual style of learning meant that he did not make the same progress at school as did his friends. By the age

mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq

model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Move the model and its rotary embedding to the GPU

model.to('cuda')
model.model.rotary_emb.to('cuda') # Explicitly move rotary_emb to GPU

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 100%
 1.76k/1.76k [00:00<00:00, 48.7kB/s]
tokenizer.model: 100%
 500k/500k [00:00<00:00, 6.05MB/s]
tokenizer.json: 100%
 1.84M/1.84M [00:00<00:00, 5.52MB/s]
special_tokens_map.json: 100%
 414/414 [00:00<00:00, 15.5kB/s]
Fetching 9 files: 100%
 9/9 [01:30<00:00, 22.95s/it]
tokenizer.model: 100%
 500k/500k [00:00<00:00, 4.27MB/s]
.gitattributes: 100%
 1.57k/1.57k [00:00<00:00, 22.0kB/s]
config.json: 100%
 694/694 [00:00<00:00, 6.03kB/s]
special_tokens_map.json: 100%
 414/414 [00:00<00:00, 4.88kB/s]
tokenizer.json: 100%
 1.84M/1.84M [00:00<00:00, 7.42MB/s]
README.md: 100%
 4.94k/4.94k [00:00<00:00, 56.3kB/s]
qmodel.pt: 100%
 3.81G/3.81G [01:30<00:00, 42.6MB/s]
adapter_v0.1.lora: 100%
 93.3M/93.3M [00:03<00:00, 38.0MB/s]
tokenizer_config.json: 100%
 1.76k/1.76k [00:00<00:00, 35.8kB/s]
/usr/local/lib/python3.11/dist-packages/hqq/models/base.py:237: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32/32 [00:00<00:00, 903.22it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32/32 [00:03<00:00, 8.76it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Capital of India, New Delhi, India

The President of India,

Sub: Request for Pardon of Death Sentence

prompt = "Who is Napoleon Bonaparte?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0]
print(tokenizer.decode(generate_ids))

prompt = "Who is Napoleon Bonaparte?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0]
print(tokenizer.decode(generate_ids))

Sign up or log in to comment