hugging-quants/gemma-2-9b-it-AWQ-INT4

This repository is a community-driven quantized version of the original model google/gemma-2-9b-it which is the BF16 half-precision official version released by Google.

This model has been quantized using transformers 4.45.0, meaning that the tokenizer available in this repository won't be compatible with lower versions. Same applies for e.g. Text Generation Inference (TGI) that only installs transformers 4.45.0 or higher starting in v2.3.1.

Model Information

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

This repository contains google/gemma-2-9b-it quantized using AutoAWQ from FP16 down to INT4 using the GEMM kernels performing zero-point quantization with a group size of 128.

Model Usage

In order to run the inference with Gemma2 9B Instruct AWQ in INT4, around 6 GiB of VRAM are needed only for loading the model checkpoint, without including the KV cache or the CUDA graphs, meaning that there should be a bit over that VRAM available.

In order to use the current quantized model, support is offered for different solutions as transformers, autoawq, or text-generation-inference.

🤗 Transformers

In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:

pip install -q --upgrade "transformers>=4.45.0" accelerate
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c

To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via AutoModelForCausalLM and run the inference normally.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "hugging-quants/gemma-2-9b-it-AWQ-INT4"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # Note: Update this as per your use-case
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
  quantization_config=quantization_config
)

prompt = [
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

AutoAWQ

In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:

pip install -q --upgrade "transformers>=4.45.0" accelerate
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c

Alternatively, one may want to run that via AutoAWQ even though it's built on top of 🤗 transformers, which is the recommended approach instead as described above.

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hugging-quants/gemma-2-9b-it-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

The AutoAWQ script has been adapted from AutoAWQ/examples/generate.py.

🤗 Text Generation Inference (TGI)

To run the text-generation-launcher with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see installation notes).

Then you just need to run the TGI v2.3.0 (or higher) Docker container as follows:

docker run --gpus all --shm-size 1g -ti -p 8080:80 \
  -v hf_cache:/data \
  -e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
  -e QUANTIZE=awq \
  -e MAX_INPUT_LENGTH=4000 \
  -e MAX_TOTAL_TOKENS=4096 \
  ghcr.io/huggingface/text-generation-inference:2.3.0

TGI will expose different endpoints, to see all the endpoints available check TGI OpenAPI Specification.

To send request to the deployed TGI endpoint compatible with OpenAI OpenAPI specification i.e. /v1/chat/completions:

curl 0.0.0.0:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

Or programatically via the huggingface_hub Python client as follows:

import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://0.0.0.0:8080", api_key="-")

chat_completion = client.chat.completions.create(
  model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
  messages=[
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

Alternatively, the OpenAI Python client can also be used (see installation notes) as follows:

import os
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key="-")

chat_completion = client.chat.completions.create(
  model="tgi",
  messages=[
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

vLLM

To run vLLM with Gemma2 9B Instruct AWQ in INT4, you will need to have Docker installed (see installation notes) and run the latest vLLM Docker container as follows:

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model hugging-quants/gemma-2-9b-it-AWQ-INT4 \
  --max-model-len 4096

To send request to the deployed vLLM endpoint compatible with OpenAI OpenAPI specification i.e. /v1/chat/completions:

curl 0.0.0.0:8000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "hugging-quants/gemma-2-9b-it-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

Or programatically via the openai Python client (see installation notes) as follows:

import os
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))

chat_completion = client.chat.completions.create(
  model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
  messages=[
    {"role": "user", "content": "What is Deep Learning?"},
  ],
  max_tokens=128,
)

Quantization Reproduction

In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.

Additionally, you also need to accept the Gemma2 access conditions, as it is a gated model that requires accepting those first.

In order to quantize Gemma2 9B Instruct, first install the following packages:

pip install -q --upgrade "torch==2.3.0" "transformers>=4.45.0" accelerate
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c

Then you need to install the huggingface_hub Python SDK and login to the Hugging Face Hub.

pip install -q --upgrade huggingface_hub
huggingface-cli login

Then run the following script, adapted from AutoAWQ/examples/quantize.py:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "google/gemma-2-9b-it"
quant_path = "hugging-quants/gemma-2-9b-it-AWQ-INT4"
quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM",
}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

hugging-quants
/

gemma-2-9b-it-AWQ-INT4