mobiuslabsgmbh
/

Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ

Text Generation

Mixture of Experts

text-generation-inference

Model card Files Files and versions Community

appoose commited on Feb 23

Commit

956a729

•

1 Parent(s): 49119b7

adding initial metrics

Files changed (1) hide show

README.md +10 -1

README.md CHANGED Viewed

@@ -37,10 +37,19 @@ outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ----------------------------------------------------------------------------------------------------------------------------------
 </p>
 ### Quantization
 You can reproduce the model using the following quant configs:
 ``` Python
@@ -70,4 +79,4 @@ quant_config['block_sparse_moe.experts.w3'] = experts_params
 model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
 model.eval();
 ```

 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+## Performance
+| Models            | Mixtral Original | HQQ quantized    |
+|-------------------|------------------|------------------|
+| ARC (25-shot)     | 70.22            | 66.47            |
+| TruthfulQA-MC2    | 64.57            | 62.85            |
+| Winogrande (5-shot)| 81.36           | 79.40            |
 ----------------------------------------------------------------------------------------------------------------------------------
 </p>
 ### Quantization
 You can reproduce the model using the following quant configs:
 ``` Python
 model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
 model.eval();
 ```
+The code in github at https://github.com/mobiusml/hqq/blob/master/examples/hf/mixtral_13GB_example.py