error: unexpected keyword argument 'inject_fused_attention'
HI everyone, really appreciate @TheBloke for these wonderful models
Im trying to set up the TheBloke/Llama-2-70B-chat-GPTQ for basic inferencing as python code. The steps I followed were as follows:
Environment:
- A6000 RTX
- 62 GB ram
Proccess:
- install auto-gptq (GITHUB_ACTIONS=true pip3 install auto-gptq)
- install the latest transformers lib (pip3 install git+https://github.com/huggingface/transformers)
Code:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_name_or_path = "TheBloke/Llama-2-70B-chat-GPTQ"
model_basename = "gptq_model-4bit--1g"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
inject_fused_attention=False, # Required for Llama 2 70B model at this time.
use_safetensors=True,
trust_remote_code=False,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
Error:
TypeError Traceback (most recent call last)
Cell In[1], line 11
7 use_triton = False
9 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
---> 11 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
12 model_basename=model_basename,
13 inject_fused_attention=False, # Required for Llama 2 70B model at this time.
14 use_safetensors=True,
15 trust_remote_code=False,
16 device="cuda:0",
17 use_triton=use_triton,
18 quantize_config=None)
TypeError: AutoGPTQForCausalLM.from_quantized() got an unexpected keyword argument 'inject_fused_attention'
Any help will be appreciated. Thank you in advance
Its weird because in the AutoGPTQ the injection_fused_attention is declared clearly.
def from_quantized(
cls,
model_name_or_path: Optional[str] = None,
save_dir: Optional[str] = None,
device_map: Optional[Union[str, Dict[str, Union[str, int]]]] = None,
max_memory: Optional[dict] = None,
device: Optional[Union[str, int]] = None,
low_cpu_mem_usage: bool = False,
use_triton: bool = False,
inject_fused_attention: bool = True,
inject_fused_mlp: bool = True,
use_cuda_fp16: bool = True,
quantize_config: Optional[BaseQuantizeConfig] = None,
model_basename: Optional[str] = None,
use_safetensors: bool = False,
trust_remote_code: bool = False,
warmup_triton: bool = False,
trainable: bool = False,
**kwargs
) -> BaseGPTQForCausalLM:
model_type = check_and_get_model_type(
save_dir or model_name_or_path, trust_remote_code
)
I do not have this error I have another one (https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/18#64be578976a6e2efccc31cd0) different but it seems later than yours. Which python version are you using? (I use 3.8). Which version of auto-gptq? (I have 0.3.0).
@lasalH this error suggest AutoGPTQ is on an earlier version. I am not sure why that's happened, but can you try:
pip3 uninstall -y auto-gptq
GITHUB_ACTIONS=true pip3 install auto-gptq==0.2.2
report if there's any errors shown by that command, and if not, test again.
I've specified 0.2.2 as there's currently a bug in 0.3.0 which affects inference with some of my GPTQ uploads (the ones that have act_order + group_size together). The bug has been fixed and there should be another release soon, 0.3.1, but for now use 0.2.2