Text Generation
Transformers
Safetensors
English
llama
causal-lm
text-generation-inference
4-bit precision
gptq

Basic question about 4-bit quantization

#18
by abhinavkulkarni - opened

Hi,

Pardon me for asking this, but I have a very basic question about 4-bit quantization. How are these 4-bit quantized weights loaded in PyTorch (through HF AutoModelForCausalLM API) when PyTorch doesn't natively support int4?

For e.g., I understand how 4-bit quantized vectors (or matrixes) and the corresponding fp32 scaling factor and zero points can be stored contiguously as is explained here, however, I am not clear about how the computations are being done in PyTorch when it doesn't support a native int4 data type.

Thanks!

You can't load 4bit models in native transformers at the moment. You may be able to do so soon, when bitsandbytes releases its new 4bit mode. However then you would use the base float16 model, with something like load_in_4bit=True (not sure exactly as it's not released yet) - same principle as their current 8bit quantisations.

To load GPTQ 4bit models you need to use compatible code.

There's a relatively new repo called AutoGPTQ which aims to make it as easy as possible to load GPTQ models and then use them with standard transformers code. You still don't use AutoModelForCausalLM - instead you use AutoGPTQForCausalLM - but once the model is loaded, then you can use any normal transformers code.

Thanks @TheBloke for the reply and all the great work you do in providing quantized 4-bit models to the community.

abhinavkulkarni changed discussion status to closed

Sign up or log in to comment