Transformers

You are viewing v4.44.2 version. A newer version v4.49.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

FBGEMM FP8

With FBGEMM FP8 quantization method, you can quantize your model in FP8 (W8A8):

the weights will be quantized in 8bit (FP8) per channel
the activation will be quantized in 8bit (FP8) per token

It relies on the FBGEMM library which provides efficient low-precision general matrix multiplication for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization.

You need a GPU with compute capability>=9 (e.g. H100)

Before you begin, make sure the following libraries are installed with their latest version:

pip install --upgrade accelerate fbgemm-gpu torch

If you are having issues with fbgemm-gpu and torch library, you might need to install the nighlty release. You can follow the instruction here

from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = FbgemmFp8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

A quantized model can be saved via “saved_pretrained” and be reused again via the “from_pretrained”.

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")

< > Update on GitHub

←HQQ Optimum→