How to use INT4 with vLLM?
It is my understanding that INT4 isn't properly supported natively by most hardware. That's why if one doesn't actively set a dtype
in the vllm engine arguments (as it isn't done in the example code in the readme), it defaults to the torch_dtype
in it's config.json
, which is float16
.
If I understand this correctly, this means that the model get de-quantized from INT4 (= 0.5 bytes per weight) to half-precision floating point (= 2 bytes per weight), which renders the whole point of quantization futile.
I must be understanding something wrong, can you help me out here? I'd greatly appreciate it!
I use this one with vLLM, the responses are mostly coherent but not great, but sometimes will generate complete garbage. Also tried the gptq, but that only generated nonsense.
I use this one with vLLM, the responses are mostly coherent but not great, but sometimes will generate complete garbage. Also tried the gptq, but that only generated nonsense.
It should be fixed now https://github.com/vllm-project/vllm/commit/75acdaa4b616c2e95c55a47d3158ceec9c72c503 .
I had the same expirience when you don't set the -q awq or gptq than it will auto set it to awq_marlin and gptq_marlin.
Marlin was bugged.
Hi here
@hfmon
if I get your question correctly, when loading AWQ/GPTQ or other quantized models the quantization or dequantization is not happening on the fly, the fact that the dtype
is set to float16
is because it's the compute data type not the loading data type, and that's mainly done because float16
is supported in most of the hardware. So on, the model is loaded in the quantized type, not in float16
or bfloat16
, but int4
in this case; same for the GPTQ quants.
Hi @bakbeest I'm afraid that may be due to the loss attached to quantizing the original weights, if you're willing for an alternative with greater precision, I'd recommend you to use https://huggingface.co/chat as there we are serving https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 for free!
Hi @Derpyhuwe thanks for the heads up, the commands here have been tried on different hardware (specifically 4 x A10 and 4 x L4), and we found no issues though. Do you mean the performance loss may be related to the kernels used within vLLM? Could you elaborate on that? Thanks in advance!
I just tested this model with vLLM compiled from sources as the current version v0.5.3.post1 doesn't have the marlin fix for awq and gptq. It detected that they were marlin kernel quants with the defaults settings with the compiled versions without problem.
To compile vLLM from sources:
git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 -m venv .venv
source activate .venv/bin/activate
or source activate .venv/bin/activate.fish for fish shell
python -m pip install -e .
It works just fine on par with other 4 bit quant of this models (gguf at Q4_K_S, exl2 at 4.0bpw...) in terms of quality on a quick test. But I haven't tested thoroughly. I want to do a 4 bit comparison between awq-marlin, gptq-marlin, gguf Q4 and exl2 4.0bpw to test for quality and performance to deploy on production.
It responds coherently and can do simple math.
GPTQ:
Note: gptq didn't load with 0.9 GPU utilization on vLLM, got OOM errors, had to lower it to 0.85, so gptq might consume more vram?
AWQ seems faster too
@bullerwins hello! Please tell me, have you done a 4-bit comparison between awq-marlin, gptq-marlin, gguf Q4 and exl2 4.0bpw? It would also be interesting to know how many tokens per second it generates for comparison.