Error in using this model for inference in Google Colab
Load model
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
Generate
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
This is the error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
You forgot to put the tokenized input on the gpu
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))
Output:
<s> Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The
Thank you so much it works now !!
it run colab t4
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
Move the model and its rotary embedding to the GPU
model.to('cuda')
model.model.rotary_emb.to('cuda') # Explicitly move rotary_emb to GPU
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN
does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Fetchingβ8βfiles:β100%
β8/8β[00:00<00:00,β141.48it/s]
/usr/local/lib/python3.11/dist-packages/hqq/models/base.py:237: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
100%|ββββββββββ| 32/32 [00:00<00:00, 577.45it/s]
100%|ββββββββββ| 32/32 [00:00<00:00, 1108.60it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results. Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The
not run
import torch, os
cache_path = '.'
compute_dtype = torch.float16
device = 'cuda:0'
###################################################################################################
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
model_id = "meta-llama/Llama-2-7b-hf"
#Basic
#Linear layers will use the same quantization config
quant_config = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default
#Each type of linear layer (referred to as linear tag) will use different quantization parameters
q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}
q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}
quant_config = HqqConfig(dynamic_config={
'self_attn.q_proj':q4_config,
'self_attn.k_proj':q4_config,
'self_attn.v_proj':q4_config,
'self_attn.o_proj':q4_config,
'mlp.gate_proj':q3_config,
'mlp.up_proj' :q3_config,
'mlp.down_proj':q3_config,
})
#####################################################################################################
model = AutoModelForCausalLM.from_pretrained(model_id,
cache_dir=cache_path,
torch_dtype=compute_dtype,
device_map="auto", #device
low_cpu_mem_usage=True,
quantization_config=quant_config
)
#Set backend
from hqq.core.quantize import *
from hqq.core.utils import cleanup
HQQLinear.set_backend(HQQBackend.ATEN)
#Forward
with torch.no_grad():
out = model(torch.zeros([1, 1024], device=device, dtype=torch.int32)).logits
print(out)
del out
cleanup()
without do sampl
prompt = "Who is Einstein?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300)[0]
print(tokenizer.decode(generate_ids))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:None for open-end generation. Who is Einstein?
surely you know Einstein, the famous scientist who discovered the theory of relativity.
Einstein is a scientist who discovered the theory of relativity.
Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scien
with do_sample=True
prompt = "Who is Einstein?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0] # Add do_sample=False here
print(tokenizer.decode(generate_ids))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:None for open-end generation. Who is Einstein?
β βΏ Lunar eclipse ββ Uranus-Ceres conjunction β βΏ
Birthday: 1879, March 14th
Birthplace: Ulm, Germany
Father : Hermann Einstein (1847β1902), engineer
Mother : Pauline Koch (1858β1920), homemaker
The most famous of all physicists, he was a revolutionary thinker who has profoundly influenced our understanding of nature, of ourselves, and of the world in which we live. He devised the general theory of relativity, one of the two pillars of modern physics (alongside quantum theory). He is widely recognized as one of the greatest and most influential scientists of all times. Einstein was a theoretical physicist and one of the key pioneers of our current understanding of both general relativity and quantum mechanics. He received the 1921 Nobel Prize for Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect". As a child, Einstein was considered slow in schoolβnot because he was slow-witted, but because his own individual style of learning meant that he did not make the same progress at school as did his friends. By the age
mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq'
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
Move the model and its rotary embedding to the GPU
model.to('cuda')
model.model.rotary_emb.to('cuda') # Explicitly move rotary_emb to GPU
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN
does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json:β100%
β1.76k/1.76kβ[00:00<00:00,β48.7kB/s]
tokenizer.model:β100%
β500k/500kβ[00:00<00:00,β6.05MB/s]
tokenizer.json:β100%
β1.84M/1.84Mβ[00:00<00:00,β5.52MB/s]
special_tokens_map.json:β100%
β414/414β[00:00<00:00,β15.5kB/s]
Fetchingβ9βfiles:β100%
β9/9β[01:30<00:00,β22.95s/it]
tokenizer.model:β100%
β500k/500kβ[00:00<00:00,β4.27MB/s]
.gitattributes:β100%
β1.57k/1.57kβ[00:00<00:00,β22.0kB/s]
config.json:β100%
β694/694β[00:00<00:00,β6.03kB/s]
special_tokens_map.json:β100%
β414/414β[00:00<00:00,β4.88kB/s]
tokenizer.json:β100%
β1.84M/1.84Mβ[00:00<00:00,β7.42MB/s]
README.md:β100%
β4.94k/4.94kβ[00:00<00:00,β56.3kB/s]
qmodel.pt:β100%
β3.81G/3.81Gβ[01:30<00:00,β42.6MB/s]
adapter_v0.1.lora:β100%
β93.3M/93.3Mβ[00:03<00:00,β38.0MB/s]
tokenizer_config.json:β100%
β1.76k/1.76kβ[00:00<00:00,β35.8kB/s]
/usr/local/lib/python3.11/dist-packages/hqq/models/base.py:237: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
100%|ββββββββββ| 32/32 [00:00<00:00, 903.22it/s]
100%|ββββββββββ| 32/32 [00:03<00:00, 8.76it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results. Capital of India, New Delhi, India
The President of India,
Sub: Request for Pardon of Death Sentence
prompt = "Who is Napoleon Bonaparte?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0]
print(tokenizer.decode(generate_ids))
prompt = "Who is Napoleon Bonaparte?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0]
print(tokenizer.decode(generate_ids))