how to run CPU mode in AWQ?

#1
by joyUniverse - opened
This comment has been hidden

I tried to test AWQ model by using Quickstart manual with CPU mode. But The model wasn't generated.
How to run CPU mode in AWQ?
I have a error log : NameError: name 'flash_attn_func' is not defined

Could you help me with this?

Thank you.

######################################

env

autoawq 0.2.7.post3
transformers 4.46.3
intel_extension_for_pytorch 2.5.0
######################################

test code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM
from awq.utils.utils import get_best_device

device = get_best_device()
model_name = "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-AWQ"

model = AutoAWQForCausalLM.from_quantized(
model_name,
use_ipex = True,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Choose your prompt

prompt = "Explain how wonderful you are" # English example
prompt = "์Šค์Šค๋กœ๋ฅผ ์ž๋ž‘ํ•ด ๋ด" # Korean example

messages = [
{"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
{"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)

output = model.generate(
input_ids.to("cpu"),
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
)
print(tokenizer.decode(output[0]))

joyUniverse changed discussion status to closed
joyUniverse changed discussion status to open
joyUniverse changed discussion status to closed
joyUniverse changed discussion status to open
joyUniverse changed discussion status to closed
joyUniverse changed discussion status to open
joyUniverse changed discussion status to closed
joyUniverse changed discussion status to open
LG AI Research org

Hello @joyUniverse , We apologize for the delayed response.

Let me guide you through a few points:

  1. When using CPU, you need to remove device_map="auto" from your code.
  2. There's no need to move input_ids to cpu, as it's already in CPU memory (RAM).

After implementing these changes, the original error might be resolved, though other unexpected issues may arise.
If you encounter any new errors, please share them with us so we can help resolve the issues more efficiently.

For CPU inference, the AutoAWQ documentation suggests installing the required dependencies using:

pip install autoawq[cpu]

Please note that AWQ was primarily designed for GPU inference, and we haven't thoroughly tested it in CPU environments yet.
We recommend trying the code modifications suggested above and referring to the AutoAWQ documentation. We'll update you once our testing is complete.

Thank you for your patience and understanding.

Sign up or log in to comment