abhinavkulkarni commited on
Commit
0c46da7
1 Parent(s): 47650b9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ language:
4
+ - en
5
+ tags:
6
+ - AWQ
7
+ inference: false
8
+ ---
9
+
10
+ # VMware/open-llama-7B-open-instruct (4-bit 128g AWQ Quantized)
11
+ [Instruction-tuned version](https://huggingface.co/VMware/open-llama-7b-open-instruct) of the fully trained [Open LLama 7B](https://huggingface.co/openlm-research/open_llama_7b) model.
12
+
13
+ This model is a 4-bit 128 group size AWQ quantized model. For more information about AWQ quantization, please click [here](https://github.com/mit-han-lab/llm-awq).
14
+
15
+ ## Model Date
16
+
17
+ July 5, 2023
18
+
19
+ ## Model License
20
+
21
+ Please refer to original MPT model license ([link](https://huggingface.co/VMware/open-llama-7b-open-instruct)).
22
+
23
+ Please refer to the AWQ quantization license ([link](https://github.com/llm-awq/blob/main/LICENSE)).
24
+
25
+ ## CUDA Version
26
+
27
+ This model was successfully tested on CUDA driver v12.1 and toolkit v11.7 with Python v3.10.11.
28
+
29
+ ## How to Use
30
+
31
+ ```bash
32
+ git clone https://github.com/mit-han-lab/llm-awq \
33
+ && cd llm-awq \
34
+ && git checkout 71d8e68df78de6c0c817b029a568c064bf22132d \
35
+ && pip install -e .
36
+ ```
37
+
38
+ ```python
39
+ import torch
40
+ from awq.quantize.quantizer import real_quantize_model_weight
41
+ from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
42
+ from accelerate import init_empty_weights, load_checkpoint_and_dispatch
43
+ from huggingface_hub import hf_hub_download
44
+
45
+ model_name = "VMware/open-llama-7b-open-instruct"
46
+
47
+ # Config
48
+ config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
49
+
50
+ # Tokenizer
51
+ tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name)
52
+
53
+ # Model
54
+ w_bit = 4
55
+ q_config = {
56
+ "zero_point": True,
57
+ "q_group_size": 128,
58
+ }
59
+
60
+ load_quant = hf_hub_download('abhinavkulkarni/open-llama-7b-open-instruct-w4-g128-awq', 'pytorch_model.bin')
61
+
62
+ with init_empty_weights():
63
+ model = AutoModelForCausalLM.from_pretrained(model_name, config=config,
64
+ torch_dtype=torch.float16, trust_remote_code=True)
65
+
66
+ real_quantize_model_weight(model, w_bit=w_bit, q_config=q_config, init_only=True)
67
+
68
+ model = load_checkpoint_and_dispatch(model, load_quant, device_map="balanced")
69
+
70
+ # Inference
71
+ prompt = f'''What is the difference between nuclear fusion and fission?
72
+ ###Response:'''
73
+
74
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
75
+ output = model.generate(
76
+ inputs=input_ids,
77
+ temperature=0.7,
78
+ max_new_tokens=512,
79
+ top_p=0.15,
80
+ top_k=0,
81
+ repetition_penalty=1.1,
82
+ eos_token_id=tokenizer.eos_token_id
83
+ )
84
+ print(tokenizer.decode(output[0]))
85
+ ```
86
+
87
+ ## Evaluation
88
+
89
+ This evaluation was done using [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness).
90
+
91
+ [Open-LLaMA-7B-Instruct](https://huggingface.co/VMware/open-llama-7b-open-instruct)
92
+
93
+ | Task |Version| Metric | Value | |Stderr|
94
+ |--------|------:|---------------|------:|---|------|
95
+ |wikitext| 1|word_perplexity|11.7531| | |
96
+ | | |byte_perplexity| 1.5853| | |
97
+ | | |bits_per_byte | 0.6648| | |
98
+
99
+ [Open-LLaMA-7B-Instruct (4-bit 128-group AWQ)](https://huggingface.co/abhinavkulkarni/open-llama-7b-open-instruct-w4-g128-awq)
100
+
101
+ | Task |Version| Metric | Value | |Stderr|
102
+ |--------|------:|---------------|------:|---|------|
103
+ |wikitext| 1|word_perplexity|12.1840| | |
104
+ | | |byte_perplexity| 1.5961| | |
105
+ | | |bits_per_byte | 0.6745| | |
106
+
107
+ ## Acknowledgements
108
+
109
+ If you found OpenLLaMA useful in your research or applications, please cite using the following BibTeX:
110
+ ```
111
+ @software{openlm2023openllama,
112
+ author = {Geng, Xinyang and Liu, Hao},
113
+ title = {OpenLLaMA: An Open Reproduction of LLaMA},
114
+ month = May,
115
+ year = 2023,
116
+ url = {https://github.com/openlm-research/open_llama}
117
+ }
118
+ ```
119
+ ```
120
+ @software{together2023redpajama,
121
+ author = {Together Computer},
122
+ title = {RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset},
123
+ month = April,
124
+ year = 2023,
125
+ url = {https://github.com/togethercomputer/RedPajama-Data}
126
+ }
127
+ ```
128
+ ```
129
+ @article{touvron2023llama,
130
+ title={Llama: Open and efficient foundation language models},
131
+ author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and others},
132
+ journal={arXiv preprint arXiv:2302.13971},
133
+ year={2023}
134
+ }
135
+ ```
136
+
137
+ The model was quantized with AWQ technique. If you find AWQ useful or relevant to your research, please kindly cite the paper:
138
+
139
+ ```
140
+ @article{lin2023awq,
141
+ title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
142
+ author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
143
+ journal={arXiv},
144
+ year={2023}
145
+ }
146
+ ```