sudhir2016

Dec 31, 2023

•

edited Dec 31, 2023

Load model
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Generate
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

This is the error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

sudhir2016 changed discussion title from Error in using this model for inference on in Google Colab to Error in using this model for inference in Google Colab Dec 31, 2023

mobicham

Mobius Labs GmbH org Dec 31, 2023

You forgot to put the tokenized input on the gpu

model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

Output:

<s> Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The

sudhir2016

Jan 1, 2024

Thank you so much it works now !!

sudhir2016 changed discussion status to closed Jan 1, 2024

rakmik

12 days ago

it run colab t4

model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Move the model and its rotary embedding to the GPU

model.to('cuda')
model.model.rotary_emb.to('cuda') # Explicitly move rotary_emb to GPU

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Fetching 8 files: 100%
8/8 [00:00<00:00, 141.48it/s]
/usr/local/lib/python3.11/dist-packages/hqq/models/base.py:237: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
100%|██████████| 32/32 [00:00<00:00, 577.45it/s]
100%|██████████| 32/32 [00:00<00:00, 1108.60it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
~~Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The~~

rakmik

12 days ago

not run

import torch, os

cache_path = '.'
compute_dtype = torch.float16
device = 'cuda:0'
###################################################################################################
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

model_id = "meta-llama/Llama-2-7b-hf"

#Basic
#Linear layers will use the same quantization config
quant_config = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default

#Each type of linear layer (referred to as linear tag) will use different quantization parameters

q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}

q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}

quant_config = HqqConfig(dynamic_config={

'self_attn.q_proj':q4_config,

'self_attn.k_proj':q4_config,

'self_attn.v_proj':q4_config,

'self_attn.o_proj':q4_config,

'mlp.gate_proj':q3_config,

'mlp.up_proj' :q3_config,

'mlp.down_proj':q3_config,

})

#####################################################################################################

model = AutoModelForCausalLM.from_pretrained(model_id,
cache_dir=cache_path,
torch_dtype=compute_dtype,
device_map="auto", #device
low_cpu_mem_usage=True,
quantization_config=quant_config
)

#Set backend
from hqq.core.quantize import *
from hqq.core.utils import cleanup
HQQLinear.set_backend(HQQBackend.ATEN)

#Forward
with torch.no_grad():
out = model(torch.zeros([1, 1024], device=device, dtype=torch.int32)).logits

print(out)
del out
cleanup()

rakmik

12 days ago

https://gist.github.com/mobicham/cb07c1eff443ad0918c49ab7bb03e269

rakmik

12 days ago

https://github.com/werruww/hqq-/blob/main/succ_Llama_2_7b_hf_4bit_g64_HQQ.ipynb

rakmik

12 days ago

without do sampl

prompt = "Who is Einstein?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300)[0]
print(tokenizer.decode(generate_ids))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
Who is Einstein?
surely you know Einstein, the famous scientist who discovered the theory of relativity.
Einstein is a scientist who discovered the theory of relativity.
Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scientist who discovered the theory of relativity. Einstein is a scien

rakmik

12 days ago

with do_sample=True

prompt = "Who is Einstein?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0] # Add do_sample=False here
print(tokenizer.decode(generate_ids))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
Who is Einstein?
☉ ☿ Lunar eclipse ♃♑ Uranus-Ceres conjunction ☉ ☿
Birthday: 1879, March 14th
Birthplace: Ulm, Germany
Father : Hermann Einstein (1847–1902), engineer
Mother : Pauline Koch (1858–1920), homemaker
The most famous of all physicists, he was a revolutionary thinker who has profoundly influenced our understanding of nature, of ourselves, and of the world in which we live. He devised the general theory of relativity, one of the two pillars of modern physics (alongside quantum theory). He is widely recognized as one of the greatest and most influential scientists of all times. Einstein was a theoretical physicist and one of the key pioneers of our current understanding of both general relativity and quantum mechanics. He received the 1921 Nobel Prize for Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect". As a child, Einstein was considered slow in school—not because he was slow-witted, but because his own individual style of learning meant that he did not make the same progress at school as did his friends. By the age

rakmik

12 days ago

mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq

model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Move the model and its rotary embedding to the GPU

model.to('cuda')
model.model.rotary_emb.to('cuda') # Explicitly move rotary_emb to GPU

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 100%
1.76k/1.76k [00:00<00:00, 48.7kB/s]
tokenizer.model: 100%
500k/500k [00:00<00:00, 6.05MB/s]
tokenizer.json: 100%
1.84M/1.84M [00:00<00:00, 5.52MB/s]
special_tokens_map.json: 100%
414/414 [00:00<00:00, 15.5kB/s]
Fetching 9 files: 100%
9/9 [01:30<00:00, 22.95s/it]
tokenizer.model: 100%
500k/500k [00:00<00:00, 4.27MB/s]
.gitattributes: 100%
1.57k/1.57k [00:00<00:00, 22.0kB/s]
config.json: 100%
694/694 [00:00<00:00, 6.03kB/s]
special_tokens_map.json: 100%
414/414 [00:00<00:00, 4.88kB/s]
tokenizer.json: 100%
1.84M/1.84M [00:00<00:00, 7.42MB/s]
README.md: 100%
4.94k/4.94k [00:00<00:00, 56.3kB/s]
qmodel.pt: 100%
3.81G/3.81G [01:30<00:00, 42.6MB/s]
adapter_v0.1.lora: 100%
93.3M/93.3M [00:03<00:00, 38.0MB/s]
tokenizer_config.json: 100%
1.76k/1.76k [00:00<00:00, 35.8kB/s]
/usr/local/lib/python3.11/dist-packages/hqq/models/base.py:237: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
100%|██████████| 32/32 [00:00<00:00, 903.22it/s]
100%|██████████| 32/32 [00:03<00:00, 8.76it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
~~Capital of India, New Delhi, India~~

The President of India,

Sub: Request for Pardon of Death Sentence

rakmik

12 days ago

prompt = "Who is Napoleon Bonaparte?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=300, do_sample=True)[0]
print(tokenizer.decode(generate_ids))

rakmik

12 days ago

https://github.com/werruww/hqq-/blob/main/succ_Llama_2_7b_hf_4bit_g64_HQQ_Llama_2_7b_chat_hf_2bitgs8_hqq.ipynb