Update README.md

1e4dbbc verified about 16 hours ago

4.75 kB

	---
	license: llama3.1
	train: false
	inference: false
	pipeline_tag: text-generation
	---
	This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
	We provide two versions:
	* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
	* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)

	![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/resolve/main/llama3.1_4bit.gif)


	## Model Size
	\| Models \| fp16\| HQQ 4-bit/gs-64 \| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>\| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> \|
	\|:-------------------:\|:--------:\|:----------------:\|:----------------:\|:----------------:\|
	\| Bitrate (Linear layers) \| 16 \| 4.5 \| 4.25 \| 4.25 \|
	\| VRAM (GB) \| 15.7 \| 6.1 \| 6.3 \| 5.7 \|

	## Model Decoding Speed
	\| Models \| fp16\| HQQ 4-bit/gs-64\| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>\| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> \|
	\|:-------------------:\|:--------:\|:----------------:\|:----------------:\|:----------------:\|
	\| Decoding* - short seq (tokens/sec)\| 53 \| <b>125</b> \| 67 \| 3.7 \|
	\| Decoding* - long seq (tokens/sec)\| 50 \| <b>97</b> \| 65 \| 21 \|

	*: RTX 3090

	## Performance

	\| Models \| fp16 \| HQQ 4-bit/gs-64 (no calib) \| HQQ 4-bit/gs-64 (calib) \| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> \| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> \|
	\|:-------------------:\|:--------:\|:----------------:\|:----------------:\|:----------------:\|:----------------:\|
	\| ARC (25-shot) \| 60.49 \| 60.32 \| 60.92 \| 57.85 \| 61.18 \|
	\| HellaSwag (10-shot)\| 80.16 \| 79.21 \| 79.52 \| 79.28 \| 77.82 \|
	\| MMLU (5-shot) \| 68.98 \| 67.07 \| 67.74 \| 67.14 \| 67.93 \|
	\| TruthfulQA-MC2 \| 54.03 \| 53.89 \| 54.11 \| 51.87 \| 53.58 \|
	\| Winogrande (5-shot)\| 77.98 \| 76.24 \| 76.48 \| 76.4 \| 76.64 \|
	\| GSM8K (5-shot) \| 75.44 \| 71.27 \| 75.36 \| 73.47 \| 72.25 \|
	\| Average \| 69.51 \| 68.00 \| <b>69.02</b> \| 67.67 \| 68.23 \|
	\| Relative performance \| 100% \| 97.83% \| <b>99.3%</b> \| 97.35% \| 98.16% \|

	You can reproduce the results above via `pip install lm-eval==0.4.3`

	## Usage
	First, install the dependecies:
	```
	pip install git+https://github.com/mobiusml/hqq.git #master branch fix
	pip install bitblas #if you use the bitblas backend
	```
	Also, make sure you use at least torch `2.4.0` or the nightly build with at least CUDA 12.1.

	Then you can use the sample code below:
	``` Python
	import torch
	from transformers import AutoTokenizer
	from hqq.models.hf.base import AutoHQQHFModel
	from hqq.utils.patching import *
	from hqq.core.quantize import *
	from hqq.utils.generation_hf import HFGenerator

	#Settings
	###################################################
	backend = "torchao_int4" #'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit) or "gemlite" (8-bit, 4-bit, 2-bit, 1-bit)
	compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
	device = 'cuda:0'
	cache_dir = '.'

	#Load the model
	###################################################
	#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
	model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version

	model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, device=device).eval()
	tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

	#Use optimized inference kernels
	###################################################
	prepare_for_inference(model, backend=backend)

	#Generate
	###################################################
	#For longer context, make sure to allocate enough cache via the cache_size= parameter
	gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

	gen.generate("Write an essay about large language models", print_tokens=True)
	gen.generate("Tell me a funny joke!", print_tokens=True)
	gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

	```