This is an HQQ 4-bit quantized Llama2-7B-chat model without grouping using a low-rank adapter to improve the performance (referred to as HQQ+).
This model doesn't use grouping to make it compatible with the fast Marlin inference kernel.

Running quantized models efficiently for inference requires using fused matrix-vector multiplications. The kernels available now have some constraints on the choice of the group-size and the axis along-which quantization is performed. This model doesn't use grouping to make it compatible with all the kernels that operate along axis=1.

Performance

Models Llama2-7B-chat (fp16) Llama2-7B-chat (HQQ+ 4-bit/no-gs)
ARC (25-shot) 53.67 48.46
HellaSwag (10-shot) 78.56 73.33
MMLU (5-shot) 48.16 44.87
TruthfulQA-MC2 45.32 43.27
Winogrande (5-shot) 72.53 71.67
GSM8K (5-shot) 23.12 27.82
Average 53.56 51.57

Usage

First, install the latest version of HQQ:

pip install git+https://github.com/mobiusml/hqq.git

Then you can use the sample code below:

import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
from hqq.utils.patching import *
from hqq.utils.generation_hf import HFGenerator

#Settings
###################################################
backend       = "torchao_int4" #'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit) or "gemlite" (8-bit, 4-bit, 2-bit, 1-bit)
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
device        = 'cuda:0'
cache_dir     = '.'

#Load the model
###################################################
model_id  = "mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq"
model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, adapter='adapter_v0.1.lora', device=device).eval();
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

#Use optimized inference kernels
###################################################
prepare_for_inference(model, backend=backend) #It takes a while...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
#gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)
Downloads last month
15
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.