quantization gptq_marlin (not found gptq_marlin) not work. , remove it. work.

by linpan - opened Jul 27

Jul 27

env:: vllm0.5.3.post

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000
-v hf_cache:/root/.cache/huggingface
vllm/vllm-openai:latest
--model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4
--quantization gptq_marlin
--tensor-parallel-size 8
--max-model-len 4096

alvarobartt

Hugging Quants org Jul 28

Hi here @linpan could you please add the error or elaborate more on why it fails? Thanks!

linpan

Jul 29

--quantization gptq_marlin not found quantization method.

linpan

Jul 29

remove "--quantization gptq_marlin" is working. vllm0.5.3 support gptq_marlin

alvarobartt

Hugging Quants org Jul 29

Well, that's odd, since it should support gptq_marlin as per https://docs.vllm.ai/en/v0.5.3/models/engine_args.html

--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, fp8, fbgemm_fp8, marlin, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, squeezellm, compressed-tensors, bitsandbytes, None

Method used to quantize the weights. If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

I guess that those will be used by default anyway, as that's more optimal, but still weird that gptq_marlin doesn't work, could you please fill an issue at https://github.com/vllm-project/vllm/issues? They will be able to address that better 🤗

linpan

Jul 31

case1: remove quantization para. ask "who are you? "
has problem: input not stop until model max_tokens.