wenhuach commited on
Commit
f201b74
1 Parent(s): 8cc1fac

replace with sym quantization as there is a large accuracy drop due to kernel issue

Browse files
Files changed (6) hide show
  1. README.md +33 -47
  2. config.json +23 -24
  3. model.safetensors +2 -2
  4. quantize_config.json +4 -3
  5. tokenizer.json +1 -0
  6. tokenizer_config.json +1 -0
README.md CHANGED
@@ -1,21 +1,6 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - NeelNanda/pile-10k
5
- ---
6
-
7
-
8
-
9
-
10
-
11
-
12
  ## Model Details
13
 
14
- This model is an int4 model with group_size 128 of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) generated by [intel/auto-round](https://github.com/intel/auto-round).
15
-
16
- ## How To Use
17
-
18
-
19
 
20
 
21
 
@@ -23,7 +8,7 @@ This model is an int4 model with group_size 128 of [microsoft/phi-2](https://hug
23
 
24
  ### INT4 Inference with AutoGPTQ
25
 
26
- Install [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) from source first
27
 
28
  ```python
29
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -35,32 +20,44 @@ inputs = tokenizer(text, return_tensors="pt", return_attention_mask=False).to(mo
35
  outputs = model.generate(**inputs, max_new_tokens=50)
36
  text = tokenizer.batch_decode(outputs)[0]
37
  print(text)
 
 
 
 
 
 
 
 
 
38
  ```
39
 
40
 
41
 
42
  ### Evaluate the model
43
 
44
- Install [lm-eval-harness 0.4.2](https://github.com/EleutherAI/lm-evaluation-harness.git) from source.
45
 
46
- Since we encountered an issue evaluating this model with lm-eval, we opted to evaluate the qdq model instead. In our assessment, we found that its accuracy closely matches that of the real quantized model in most cases except for some small models like opt-125m.
 
 
47
 
48
 
49
 
50
- | Metric | FP16 | INT4 qdq |
51
  | -------------- | ------ | ------ |
52
- | Avg. | 0.6138 | 0.6115 |
53
- | mmlu | 0.5325 | 0.5417 |
54
- | lambada_openai | 0.6276 | 0.6225 |
55
- | hellaswag | 0.5584 | 0.5498 |
56
- | winogrande | 0.7561 | 0.7545 |
57
- | piqa | 0.7867 | 0.7824 |
58
- | truthfulqa_mc1 | 0.3146 | 0.3060 |
59
- | openbookqa | 0.4020 | 0.4100 |
60
- | boolq | 0.8330 | 0.8327 |
61
- | arc_easy | 0.7992 | 0.7955 |
62
- | arc_challenge | 0.5282 | 0.5196 |
63
-
 
64
 
65
  ### Reproduce the model
66
 
@@ -76,16 +73,15 @@ python3 main.py \
76
  --group_size 128 \
77
  --bits 4 \
78
  --iters 1000 \
79
- --enable_minmax_tuning \
80
- --disable_quanted_input \
81
  --deployment_device 'gpu' \
82
- --scale_dtype 'fp32' \
83
- --eval_bs 32 \
84
  --output_dir "./tmp_autoround" \
85
- --amp
86
 
87
  ```
88
 
 
 
89
  ## Ethical Considerations and Limitations
90
 
91
  The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
@@ -106,13 +102,3 @@ Here are a couple of useful links to learn more about Intel's AI software:
106
  The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
107
 
108
 
109
- ## Cite
110
-
111
- @article{cheng2023optimize,
112
- title={Optimize weight rounding via signed gradient descent for the quantization of llms},
113
- author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao},
114
- journal={arXiv preprint arXiv:2309.05516},
115
- year={2023}
116
- }
117
-
118
- [arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## Model Details
2
 
3
+ This model is an int4 model with group_size128 and sym quantization of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) generated by [intel/auto-round](https://github.com/intel/auto-round). We found there is a large accuracy drop of asym kernel for this model.
 
 
 
 
4
 
5
 
6
 
 
8
 
9
  ### INT4 Inference with AutoGPTQ
10
 
11
+ pip install auto-gptq
12
 
13
  ```python
14
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
20
  outputs = model.generate(**inputs, max_new_tokens=50)
21
  text = tokenizer.batch_decode(outputs)[0]
22
  print(text)
23
+ """
24
+ There is a girl who likes adventure,
25
+ She loves to explore and to venture.
26
+ She travels to faraway lands,
27
+ And meets people from different lands.
28
+ She learns new languages and cultures,
29
+ And makes friends with all kinds of people.
30
+ She is curious and brave and
31
+ """
32
  ```
33
 
34
 
35
 
36
  ### Evaluate the model
37
 
38
+ pip install lm-eval==0.4.2
39
 
40
+ ~~bash
41
+ lm_eval --model hf --model_args pretrained="Intel/phi-2-int4-inc" --device cuda:0 --tasks lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,mmlu --batch_size 16
42
+ ~~
43
 
44
 
45
 
46
+ | Metric | FP16 | INT4 |
47
  | -------------- | ------ | ------ |
48
+ | Avg. | 0.6131 | 0.6062 |
49
+ | mmlu | 0.5334 | 0.5241 |
50
+ | lambada_openai | 0.6243 | 0.6039 |
51
+ | hellaswag | 0.5581 | 0.5487 |
52
+ | winogrande | 0.7522 | 0.7585 |
53
+ | piqa | 0.7867 | 0.7840 |
54
+ | truthfulqa_mc1 | 0.3097 | 0.2974 |
55
+ | openbookqa | 0.4040 | 0.3960 |
56
+ | boolq | 0.8346 | 0.8346 |
57
+ | arc_easy | 0.8001 | 0.8013 |
58
+ | arc_challenge | 0.5282 | 0.5137 |
59
+
60
+ ##
61
 
62
  ### Reproduce the model
63
 
 
73
  --group_size 128 \
74
  --bits 4 \
75
  --iters 1000 \
 
 
76
  --deployment_device 'gpu' \
77
+ --disable_low_gpu_mem_usage \
 
78
  --output_dir "./tmp_autoround" \
79
+
80
 
81
  ```
82
 
83
+
84
+
85
  ## Ethical Considerations and Limitations
86
 
87
  The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
 
102
  The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
103
 
104
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,34 +1,31 @@
1
  {
2
  "_name_or_path": "/models/phi-2",
3
- "activation_function": "gelu_new",
4
  "architectures": [
5
  "PhiForCausalLM"
6
  ],
7
- "attn_pdrop": 0.0,
8
- "auto_map": {
9
- "AutoConfig": "configuration_phi.PhiConfig",
10
- "AutoModelForCausalLM": "modeling_phi.PhiForCausalLM"
11
- },
12
  "embd_pdrop": 0.0,
13
- "flash_attn": false,
14
- "flash_rotary": false,
15
- "fused_dense": false,
16
- "img_processor": null,
17
  "initializer_range": 0.02,
18
- "layer_norm_epsilon": 1e-05,
19
- "model_type": "phi-msft",
20
- "n_embd": 2560,
21
- "n_head": 32,
22
- "n_head_kv": null,
23
- "n_inner": null,
24
- "n_layer": 32,
25
- "n_positions": 2048,
 
26
  "quantization_config": {
27
- "autoround_version": "0.1",
28
  "bits": 4,
29
  "damp_percent": 0.01,
30
  "desc_act": false,
31
  "enable_minmax_tuning": true,
 
32
  "group_size": 128,
33
  "is_marlin_format": false,
34
  "iters": 1000,
@@ -37,15 +34,17 @@
37
  "model_file_base_name": "model",
38
  "model_name_or_path": null,
39
  "quant_method": "gptq",
 
40
  "static_groups": false,
41
- "sym": false,
42
- "true_sequential": false,
43
- "use_quant_input": false
44
  },
45
  "resid_pdrop": 0.1,
46
- "rotary_dim": 32,
 
47
  "tie_word_embeddings": false,
48
  "torch_dtype": "float16",
49
- "transformers_version": "4.37.2",
 
50
  "vocab_size": 51200
51
  }
 
1
  {
2
  "_name_or_path": "/models/phi-2",
 
3
  "architectures": [
4
  "PhiForCausalLM"
5
  ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 50256,
 
 
 
8
  "embd_pdrop": 0.0,
9
+ "eos_token_id": 50256,
10
+ "hidden_act": "gelu_new",
11
+ "hidden_size": 2560,
 
12
  "initializer_range": 0.02,
13
+ "intermediate_size": 10240,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 2048,
16
+ "model_type": "phi",
17
+ "num_attention_heads": 32,
18
+ "num_hidden_layers": 32,
19
+ "num_key_value_heads": 32,
20
+ "partial_rotary_factor": 0.4,
21
+ "qk_layernorm": false,
22
  "quantization_config": {
23
+ "autoround_version": "0.2.0.dev",
24
  "bits": 4,
25
  "damp_percent": 0.01,
26
  "desc_act": false,
27
  "enable_minmax_tuning": true,
28
+ "enable_quanted_input": true,
29
  "group_size": 128,
30
  "is_marlin_format": false,
31
  "iters": 1000,
 
34
  "model_file_base_name": "model",
35
  "model_name_or_path": null,
36
  "quant_method": "gptq",
37
+ "scale_dtype": "float16",
38
  "static_groups": false,
39
+ "sym": true,
40
+ "true_sequential": false
 
41
  },
42
  "resid_pdrop": 0.1,
43
+ "rope_scaling": null,
44
+ "rope_theta": 10000.0,
45
  "tie_word_embeddings": false,
46
  "torch_dtype": "float16",
47
+ "transformers_version": "4.40.2",
48
+ "use_cache": true,
49
  "vocab_size": 51200
50
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5ac1581429bd89dcbaf12f7dd2aeef327c71b3e37b1b15f9bd9343b63c16c3a6
3
- size 1836015136
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14c3ff2501ea2449bd14376ffe6ed545a3e6b04c0e273686acfc6f9cbe14cf22
3
+ size 1836707656
quantize_config.json CHANGED
@@ -4,16 +4,17 @@
4
  "damp_percent": 0.01,
5
  "desc_act": false,
6
  "static_groups": false,
7
- "sym": false,
8
  "true_sequential": false,
9
  "model_name_or_path": null,
10
  "model_file_base_name": "model",
11
  "is_marlin_format": false,
12
  "quant_method": "intel/auto-round",
13
- "autoround_version": "0.1",
14
  "iters": 1000,
15
  "lr": 0.001,
16
  "minmax_lr": 0.001,
17
  "enable_minmax_tuning": true,
18
- "use_quant_input": false
 
19
  }
 
4
  "damp_percent": 0.01,
5
  "desc_act": false,
6
  "static_groups": false,
7
+ "sym": true,
8
  "true_sequential": false,
9
  "model_name_or_path": null,
10
  "model_file_base_name": "model",
11
  "is_marlin_format": false,
12
  "quant_method": "intel/auto-round",
13
+ "autoround_version": "0.2.0.dev",
14
  "iters": 1000,
15
  "lr": 0.001,
16
  "minmax_lr": 0.001,
17
  "enable_minmax_tuning": true,
18
+ "enable_quanted_input": true,
19
+ "scale_dtype": "float16"
20
  }
tokenizer.json CHANGED
@@ -382,6 +382,7 @@
382
  "end_of_word_suffix": "",
383
  "fuse_unk": false,
384
  "byte_fallback": false,
 
385
  "vocab": {
386
  "!": 0,
387
  "\"": 1,
 
382
  "end_of_word_suffix": "",
383
  "fuse_unk": false,
384
  "byte_fallback": false,
385
+ "ignore_merges": false,
386
  "vocab": {
387
  "!": 0,
388
  "\"": 1,
tokenizer_config.json CHANGED
@@ -318,6 +318,7 @@
318
  "clean_up_tokenization_spaces": true,
319
  "eos_token": "<|endoftext|>",
320
  "model_max_length": 2048,
 
321
  "tokenizer_class": "CodeGenTokenizer",
322
  "unk_token": "<|endoftext|>"
323
  }
 
318
  "clean_up_tokenization_spaces": true,
319
  "eos_token": "<|endoftext|>",
320
  "model_max_length": 2048,
321
+ "return_token_type_ids": false,
322
  "tokenizer_class": "CodeGenTokenizer",
323
  "unk_token": "<|endoftext|>"
324
  }