metadata
license: apache-2.0
tags:
- moe
train: false
inference: false
pipeline_tag: text-generation
Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
Basic Usage
To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ'
#Load the model
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
#Optional
from hqq.core.quantize import *
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
#Text Generation
prompt = "<s> [INST] How do I build a car? [/INST] "
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantization
You can reproduce the model using the following quant configs:
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
#Quantize params
from hqq.core.quantize import *
attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
attn_prams['scale_quant_params']['group_size'] = 256
attn_prams['zero_quant_params']['group_size'] = 256
quant_config = {}
#Attention
quant_config['self_attn.q_proj'] = attn_prams
quant_config['self_attn.k_proj'] = attn_prams
quant_config['self_attn.v_proj'] = attn_prams
quant_config['self_attn.o_proj'] = attn_prams
#Experts
quant_config['block_sparse_moe.experts.w1'] = experts_params
quant_config['block_sparse_moe.experts.w2'] = experts_params
quant_config['block_sparse_moe.experts.w3'] = experts_params
#Quantize
model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
model.eval();