mobiuslabsgmbh
/

Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ

@@ -7,14 +7,15 @@ inference: false
 pipeline_tag: text-generation
 ---
 ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
-This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
-More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
 ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
-The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
 ----------------------------------------------------------------------------------------------------------------------------------
 </p>
@@ -23,14 +24,14 @@ The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixt
 ## Performance
 | Models            | Mixtral Original | HQQ quantized    |
 |-------------------|------------------|------------------|
-| Runtime VRAM      | 90 GB            | <b>13 GB</b>     |
-| ARC (25-shot)     | 70.22            | 66.47            |
-| Hellaswag (10-shot)|     87.63       |   84.78         |
-| MMLU (5-shot)      |      71.16      |  67.35          |
-| TruthfulQA-MC2    | 64.58            | 62.85            |
-| Winogrande (5-shot)| 81.37           | 79.40            |
-| GSM8K (5-shot)|      60.73      |        45.86    |
-| Average|   72.62         |  67.79         |
 ## Screencast
@@ -104,8 +105,12 @@ model     = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth
 from hqq.core.quantize import *
 attn_prams     = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
 experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
-attn_prams['scale_quant_params']['group_size'] = 256
-attn_prams['zero_quant_params']['group_size']  = 256
 quant_config = {}
 #Attention

 pipeline_tag: text-generation
 ---
 ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
+This is a version of the <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
+The difference between this model and <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ"> our previous release </a> is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
+*Note*: this model was updated to use a group-size of 128 instead of 256 for the scale/zero parameters, which slightly improves the overall score with a negligible increase in VRAM.
 ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
 ----------------------------------------------------------------------------------------------------------------------------------
 </p>
 ## Performance
 | Models            | Mixtral Original | HQQ quantized    |
 |-------------------|------------------|------------------|
+| Runtime VRAM      | 94 GB            | <b>13.5 GB</b>     |
+| ARC (25-shot)     | 70.22            | 66.55            |
+| Hellaswag (10-shot)|     87.63       |   84.83         |
+| MMLU (5-shot)      |      71.16      |  67.39          |
+| TruthfulQA-MC2    | 64.58            | 62.80            |
+| Winogrande (5-shot)| 81.37           | 80.03            |
+| GSM8K (5-shot)|      60.73      |        45.41    |
+| Average|   72.62         |  67.83         |
 ## Screencast
 from hqq.core.quantize import *
 attn_prams     = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
 experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
+zero_scale_group_size = 128
+attn_prams['scale_quant_params']['group_size']     = zero_scale_group_size
+attn_prams['zero_quant_params']['group_size']      = zero_scale_group_size
+experts_params['scale_quant_params']['group_size'] = zero_scale_group_size
+experts_params['zero_quant_params']['group_size']  = zero_scale_group_size
 quant_config = {}
 #Attention