how to run CPU mode in AWQ?
I tried to test AWQ model by using Quickstart manual with CPU mode. But The model wasn't generated.
How to run CPU mode in AWQ?
I have a error log : NameError: name 'flash_attn_func' is not defined
Could you help me with this?
Thank you.
######################################
env
autoawq 0.2.7.post3
transformers 4.46.3
intel_extension_for_pytorch 2.5.0
######################################
test code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM
from awq.utils.utils import get_best_device
device = get_best_device()
model_name = "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
use_ipex = True,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Choose your prompt
prompt = "Explain how wonderful you are" # English example
prompt = "์ค์ค๋ก๋ฅผ ์๋ํด ๋ด" # Korean example
messages = [
{"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
output = model.generate(
input_ids.to("cpu"),
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
)
print(tokenizer.decode(output[0]))
Hello @joyUniverse , We apologize for the delayed response.
Let me guide you through a few points:
- When using CPU, you need to remove
device_map="auto"
from your code. - There's no need to move
input_ids
tocpu
, as it's already in CPU memory (RAM).
After implementing these changes, the original error might be resolved, though other unexpected issues may arise.
If you encounter any new errors, please share them with us so we can help resolve the issues more efficiently.
For CPU inference, the AutoAWQ documentation suggests installing the required dependencies using:
pip install autoawq[cpu]
Please note that AWQ was primarily designed for GPU inference, and we haven't thoroughly tested it in CPU environments yet.
We recommend trying the code modifications suggested above and referring to the AutoAWQ documentation. We'll update you once our testing is complete.
Thank you for your patience and understanding.