Transformers documentation

SpQR

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.48.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SpQR

SpQR quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression.

To SpQR-quantize a model, refer to the Vahe1994/SpQR repository.

Load a pre-SpQR-quantized model in from_pretrained().

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

quantized_model = AutoModelForCausalLM.from_pretrained(
    "elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
    torch_dtype=torch.half,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")
< > Update on GitHub