--- license: apache-2.0 tags: - moe train: false inference: false pipeline_tag: text-generation --- ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ This is a version of the Mixtral-8x7B-Instruct-v0.1 model quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit. The difference between this model and our previous release is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB! *Note*: this model was updated to use a group-size of 128 instead of 256 for the scale/zero parameters, which slightly improves the overall score with a negligible increase in VRAM. ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif) ----------------------------------------------------------------------------------------------------------------------------------

## Performance | Models | Mixtral Original | HQQ quantized | |-------------------|------------------|------------------| | Runtime VRAM | 94 GB | 13.5 GB | | ARC (25-shot) | 70.22 | 66.55 | | Hellaswag (10-shot)| 87.63 | 84.83 | | MMLU (5-shot) | 71.16 | 67.39 | | TruthfulQA-MC2 | 64.58 | 62.80 | | Winogrande (5-shot)| 81.37 | 80.03 | | GSM8K (5-shot)| 60.73 | 45.41 | | Average| 72.62 | 67.83 | ## Screencast Here is a small screencast of the model running on RTX 4090 ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/upGS5kOw_m-ry8WcMO9gJ.gif) ### Basic Usage To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows: ``` Python import transformers from threading import Thread model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ' #Load the model from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_id) model = HQQModelForCausalLM.from_quantized(model_id) #Optional: set backend/compile #You will need to install CUDA kernels apriori # git clone https://github.com/mobiusml/hqq/ # cd hqq/kernels && python setup_cuda.py install from hqq.core.quantize import * HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP) def chat_processor(chat, max_new_tokens=100, do_sample=True): tokenizer.use_default_system_prompt = False streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True) generate_params = dict( tokenizer(" [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'), streamer=streamer, max_new_tokens=max_new_tokens, do_sample=do_sample, top_p=0.90, top_k=50, temperature= 0.6, num_beams=1, repetition_penalty=1.2, ) t = Thread(target=model.generate, kwargs=generate_params) t.start() outputs = [] for text in streamer: outputs.append(text) print(text, end="", flush=True) return outputs ################################################################################################ #Generation outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False) ``` ### Quantization You can reproduce the model using the following quant configs: ``` Python from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path) #Quantize params from hqq.core.quantize import * attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True) experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True) zero_scale_group_size = 128 attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size experts_params['scale_quant_params']['group_size'] = zero_scale_group_size experts_params['zero_quant_params']['group_size'] = zero_scale_group_size quant_config = {} #Attention quant_config['self_attn.q_proj'] = attn_prams quant_config['self_attn.k_proj'] = attn_prams quant_config['self_attn.v_proj'] = attn_prams quant_config['self_attn.o_proj'] = attn_prams #Experts quant_config['block_sparse_moe.experts.w1'] = experts_params quant_config['block_sparse_moe.experts.w2'] = experts_params quant_config['block_sparse_moe.experts.w3'] = experts_params #Quantize model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16); model.eval(); ``` The code in github at https://github.com/mobiusml/hqq/blob/master/examples/hf/mixtral_13GB_example.py