Source codes to quantize the LLaMA 3.1 405B model

#10

by shuyuej - opened Aug 3, 2024

Discussion

shuyuej

Aug 3, 2024

Could you please share the source codes for quantizing the LLaMA 3.1 405B model?

Thank you very much in advance!

alvarobartt

Hugging Quants org Aug 3, 2024

Hi here @shuyuej , those are already within the model card 🤗

See https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4#quantization-reproduction

import random

import numpy as np
import torch

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
from transformers import AutoTokenizer

pretrained_model_dir = "meta-llama/Meta-Llama-3.1-405B-Instruct"
quantized_model_dir = "meta-llama/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4"

print("Loading tokenizer, dataset, and tokenizing the dataset...")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", split="train")
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")

print("Setting random seeds...")
random.seed(0)
np.random.seed(0)
torch.random.manual_seed(0)

print("Setting calibration samples...")
nsamples = 128
seqlen = 2048
calibration_samples = []
for _ in range(nsamples):
    i = random.randint(0, encodings.input_ids.shape[1] - seqlen - 1)
    j = i + seqlen
    input_ids = encodings.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    calibration_samples.append({"input_ids": input_ids, "attention_mask": attention_mask})

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=True,  # set to False can significantly speed up inference but the perplexity may slightly bad
    sym=True,  # using symmetric quantization so that the range is symmetric allowing the value 0 to be precisely represented (can provide speedups)
    damp_percent=0.1,  # see https://github.com/AutoGPTQ/AutoGPTQ/issues/196
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
print("Load unquantized model...")
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
print("Quantize model with calibration samples...")
model.quantize(calibration_samples)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

shuyuej

Aug 3, 2024

•

edited Aug 3, 2024

@alvarobartt Thank you very much for sharing the codes.
How can we enable multi-GPU quantization while using the model.quantize()?
It seems that only one GPU (cuda:0) is running, and the GPU memory is not enough to conduct quantization for such a model.

Thank you very much again, and have a nice day!

Best regards,

Shuyue
August 3rd, 2024

shuyuej

Aug 3, 2024

@alvarobartt I always get this error on my side while I am using these codes:

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment