|
--- |
|
license: llama3.1 |
|
train: false |
|
inference: false |
|
pipeline_tag: text-generation |
|
--- |
|
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model. |
|
We provide two versions: |
|
* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/ |
|
* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/ |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png) |
|
|
|
![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/resolve/main/llama3.1_4bit.gif) |
|
|
|
|
|
## Model Size |
|
| Models | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> | |
|
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:| |
|
| Bitrate (Linear layers) | 16 | 4.5 | 4.25 | 4.25 | |
|
| VRAM (GB) | 15.7 | 6.1 | 6.3 | 5.7 | |
|
|
|
## Model Decoding Speed |
|
| Models | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> | |
|
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:| |
|
| Decoding* - short seq (tokens/sec)| 53 | <b>125</b> | 67 | 3.7 | |
|
| Decoding* - long seq (tokens/sec)| 50 | <b>97</b> | 65 | 21 | |
|
|
|
*: RTX 3090 |
|
|
|
## Performance |
|
|
|
| Models | fp16 | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> | |
|
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:| |
|
| ARC (25-shot) | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 | |
|
| HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 | |
|
| MMLU (5-shot) | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 | |
|
| TruthfulQA-MC2 | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 | |
|
| Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4 | 76.64 | |
|
| GSM8K (5-shot) | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 | |
|
| Average | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 | |
|
| Relative performance | 100% | 97.83% | <b>99.3%</b> | 97.35% | 98.16% | |
|
|
|
You can reproduce the results above via `pip install lm-eval==0.4.3` |
|
|
|
## Usage |
|
First, install the dependecies: |
|
``` |
|
pip install git+https://github.com/mobiusml/hqq.git #master branch fix |
|
pip install bitblas #if you use the bitblas backend |
|
``` |
|
Also, make sure you use at least torch `2.4.0` or the nightly build with at least CUDA 12.1. |
|
|
|
Then you can use the sample code below: |
|
``` Python |
|
import torch |
|
from transformers import AutoTokenizer |
|
from hqq.models.hf.base import AutoHQQHFModel |
|
from hqq.utils.patching import * |
|
from hqq.core.quantize import * |
|
from hqq.utils.generation_hf import HFGenerator |
|
|
|
#Settings |
|
################################################### |
|
backend = "torchao_int4" #'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit) or "gemlite" (8-bit, 4-bit, 2-bit, 1-bit) |
|
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16 |
|
device = 'cuda:0' |
|
cache_dir = '.' |
|
|
|
#Load the model |
|
################################################### |
|
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version |
|
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version |
|
|
|
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, device=device).eval() |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir) |
|
|
|
#Use optimized inference kernels |
|
################################################### |
|
prepare_for_inference(model, backend=backend) |
|
|
|
#Generate |
|
################################################### |
|
#For longer context, make sure to allocate enough cache via the cache_size= parameter |
|
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while |
|
|
|
gen.generate("Write an essay about large language models", print_tokens=True) |
|
gen.generate("Tell me a funny joke!", print_tokens=True) |
|
gen.generate("How to make a yummy chocolate cake?", print_tokens=True) |
|
|
|
``` |