File size: 4,830 Bytes

3a818a3
 
 
 
 
 
 
f6ddfb5
 
 
3a818a3
 
 
66b3e18
 
3a818a3
 
f6ddfb5
 
 
 
3a818a3
 
f6ddfb5
 
 
 
3a818a3
 
 
 
 
f6ddfb5
 
 
 
 
 
 
 
 
0b97abf
3a818a3
f6ddfb5
3a818a3
 
 
 
925b3f7
7a53a73
3a818a3
7a53a73
3a818a3
 
 
 
755fa88
 
1064316
3a818a3
 
 
 
 
f6ddfb5
 
 
7a53a73
755fa88
 
 
3a818a3
755fa88
 
3a818a3
 
 
 
 
7a53a73
 
3a818a3
 
 
ceab5eb
3a818a3

---
license: llama3.1
train: false
inference: false
pipeline_tag: text-generation
---
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
We provide two versions: 
* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/

![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)

![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/resolve/main/llama3.1_4bit.gif)


## Model Size
| Models            | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|  <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
| Bitrate (Linear layers)    |   16         |  4.5 | 4.25 | 4.25 |
| VRAM (GB)                  |   15.7       |  6.1 | 6.3  | 5.7  |

## Model Decoding Speed
| Models            | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|   <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
| Decoding* - short seq (tokens/sec)|   53         |    <b>125</b>     |    67   | 3.7 |
| Decoding* - long  seq (tokens/sec)|   50         |    <b>97</b>      |    65   | 21  |

*: RTX 3090

## Performance

| Models            | fp16       | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:|
| ARC (25-shot)      | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
| HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
| MMLU (5-shot)      | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
| TruthfulQA-MC2     | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
| Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4  | 76.64 |
| GSM8K (5-shot)     | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
| Average            | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 |
| Relative performance |  100% | 97.83% | <b>99.3%</b>  | 97.35% | 98.16% |

You can reproduce the results above via `pip install lm-eval==0.4.3`

## Usage
First, install the dependecies:
```
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you want to use the bitblas backend
```
Also, make sure you use at least torch `2.4.0` or the nightly build with at least CUDA 12.1. 

Then you can use the sample code below:
``` Python
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator

#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version

compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas
cache_dir = '.'
model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)

#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4") 
#prepare_for_inference(model, backend="bitblas") #takes a while to init...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

```