mobicham commited on
Commit
8c0da45
1 Parent(s): a671496

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md CHANGED
@@ -1,3 +1,67 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - moe
5
+ train: false
6
+ inference: false
7
+ pipeline_tag: text-generation
8
  ---
9
+ ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
10
+ This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
11
+
12
+ More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
13
+
14
+ The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
15
+
16
+
17
+ ### Basic Usage
18
+ To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
19
+ ``` Python
20
+ model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ'
21
+ #Load the model
22
+ from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
23
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
24
+ model = HQQModelForCausalLM.from_quantized(model_id)
25
+ #Optional
26
+ from hqq.core.quantize import *
27
+ HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
28
+ #Text Generation
29
+ prompt = "<s> [INST] How do I build a car? [/INST] "
30
+ inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
31
+ outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
32
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
33
+ ```
34
+
35
+ ----------------------------------------------------------------------------------------------------------------------------------
36
+ </p>
37
+
38
+ ### Quantization
39
+ You can reproduce the model using the following quant configs:
40
+
41
+ ``` Python
42
+ from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
43
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
44
+ model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
45
+
46
+ #Quantize params
47
+ from hqq.core.quantize import *
48
+ attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
49
+ experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
50
+ attn_prams['scale_quant_params']['group_size'] = 256
51
+ attn_prams['zero_quant_params']['group_size'] = 256
52
+
53
+ quant_config = {}
54
+ #Attention
55
+ quant_config['self_attn.q_proj'] = attn_prams
56
+ quant_config['self_attn.k_proj'] = attn_prams
57
+ quant_config['self_attn.v_proj'] = attn_prams
58
+ quant_config['self_attn.o_proj'] = attn_prams
59
+ #Experts
60
+ quant_config['block_sparse_moe.experts.w1'] = experts_params
61
+ quant_config['block_sparse_moe.experts.w2'] = experts_params
62
+ quant_config['block_sparse_moe.experts.w3'] = experts_params
63
+
64
+ #Quantize
65
+ model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
66
+ model.eval();
67
+ ```