File size: 4,830 Bytes
3a818a3 f6ddfb5 3a818a3 66b3e18 3a818a3 f6ddfb5 3a818a3 f6ddfb5 3a818a3 f6ddfb5 0b97abf 3a818a3 f6ddfb5 3a818a3 925b3f7 7a53a73 3a818a3 7a53a73 3a818a3 755fa88 1064316 3a818a3 f6ddfb5 7a53a73 755fa88 3a818a3 755fa88 3a818a3 7a53a73 3a818a3 ceab5eb 3a818a3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
license: llama3.1
train: false
inference: false
pipeline_tag: text-generation
---
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
We provide two versions:
* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/resolve/main/llama3.1_4bit.gif)
## Model Size
| Models | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
| Bitrate (Linear layers) | 16 | 4.5 | 4.25 | 4.25 |
| VRAM (GB) | 15.7 | 6.1 | 6.3 | 5.7 |
## Model Decoding Speed
| Models | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
| Decoding* - short seq (tokens/sec)| 53 | <b>125</b> | 67 | 3.7 |
| Decoding* - long seq (tokens/sec)| 50 | <b>97</b> | 65 | 21 |
*: RTX 3090
## Performance
| Models | fp16 | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:|
| ARC (25-shot) | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
| HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
| MMLU (5-shot) | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
| TruthfulQA-MC2 | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
| Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4 | 76.64 |
| GSM8K (5-shot) | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
| Average | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 |
| Relative performance | 100% | 97.83% | <b>99.3%</b> | 97.35% | 98.16% |
You can reproduce the results above via `pip install lm-eval==0.4.3`
## Usage
First, install the dependecies:
```
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you want to use the bitblas backend
```
Also, make sure you use at least torch `2.4.0` or the nightly build with at least CUDA 12.1.
Then you can use the sample code below:
``` Python
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator
#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version
compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas
cache_dir = '.'
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)
#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4")
#prepare_for_inference(model, backend="bitblas") #takes a while to init...
#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)
``` |