mobiuslabsgmbh
/

Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ

Text Generation

Mixture of Experts

text-generation-inference

Model card Files Files and versions Community

appoose commited on Feb 23

Commit

48d4273

•

1 Parent(s): 956a729

adding vram usage

Files changed (1) hide show

README.md +16 -10

README.md CHANGED Viewed

@@ -11,8 +11,24 @@ This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.c
 More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
 The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
 ### Basic Usage
 To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
@@ -38,16 +54,6 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-## Performance
-| Models            | Mixtral Original | HQQ quantized    |
-|-------------------|------------------|------------------|
-| ARC (25-shot)     | 70.22            | 66.47            |
-| TruthfulQA-MC2    | 64.57            | 62.85            |
-| Winogrande (5-shot)| 81.36           | 79.40            |
-----------------------------------------------------------------------------------------------------------------------------------
-</p>
 ### Quantization
 You can reproduce the model using the following quant configs:

 More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
+![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
 The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
+----------------------------------------------------------------------------------------------------------------------------------
+</p>
+## Performance
+| Models            | Mixtral Original | HQQ quantized    |
+|-------------------|------------------|------------------|
+| Runtime VRAM      | 90 GB            | <b>13 GB</b>     |
+| ARC (25-shot)     | 70.22            | 66.47            |
+| TruthfulQA-MC2    | 64.57            | 62.85            |
+| Winogrande (5-shot)| 81.36           | 79.40            |
 ### Basic Usage
 To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
 ```
 ### Quantization
 You can reproduce the model using the following quant configs: