abhinavkulkarni commited on
Commit
cf7534e
1 Parent(s): 24451a2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-3.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - AWQ
7
+ inference: false
8
+ ---
9
+
10
+ # VMware/open-llama-7B-v2-open-instruct
11
+ Instruction-tuned version of the fully trained Open LLama 7B v2 model. The model is open for <b>COMMERCIAL USE</b>. <br>
12
+
13
+ This model is a 4-bit 128 group size AWQ quantized model. For more information about AWQ quantization, please click [here](https://github.com/mit-han-lab/llm-awq).
14
+
15
+ ## Model Date
16
+
17
+ July 12, 2023
18
+
19
+ ## Model License
20
+
21
+ Please refer to original OpenLLaMa model license ([link](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)).
22
+
23
+ Please refer to the AWQ quantization license ([link](https://github.com/llm-awq/blob/main/LICENSE)).
24
+
25
+ ## CUDA Version
26
+
27
+ This model was successfully tested on CUDA driver v530.30.02 and runtime v11.7 with Python v3.10.11. Please note that AWQ requires NVIDIA GPUs with compute capability of 80 or higher.
28
+
29
+ For Docker users, the `nvcr.io/nvidia/pytorch:23.06-py3` image is runtime v12.1 but otherwise the same as the configuration above and has also been verified to work.
30
+
31
+ ## How to Use
32
+
33
+ ```bash
34
+ git clone https://github.com/mit-han-lab/llm-awq \
35
+ && cd llm-awq \
36
+ && git checkout 71d8e68df78de6c0c817b029a568c064bf22132d \
37
+ && pip install -e . \
38
+ && cd awq/kernels \
39
+ && python setup.py install
40
+ ```
41
+
42
+ ```python
43
+ import torch
44
+ from awq.quantize.quantizer import real_quantize_model_weight
45
+ from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
46
+ from accelerate import init_empty_weights, load_checkpoint_and_dispatch
47
+ from huggingface_hub import snapshot_download
48
+
49
+ model_name = "VMware/open-llama-7b-v2-open-instruct"
50
+
51
+ # Config
52
+ config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
53
+
54
+ # Tokenizer
55
+ tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name)
56
+
57
+ # Model
58
+ w_bit = 4
59
+ q_config = {
60
+ "zero_point": True,
61
+ "q_group_size": 128,
62
+ }
63
+
64
+ load_quant = snapshot_download('abhinavkulkarni/VMware-open-llama-7b-v2-open-instruct-w4-g128-awq')
65
+
66
+ with init_empty_weights():
67
+ model = AutoModelForCausalLM.from_config(config=config,
68
+ torch_dtype=torch.float16, trust_remote_code=True)
69
+
70
+ real_quantize_model_weight(model, w_bit=w_bit, q_config=q_config, init_only=True)
71
+
72
+ model = load_checkpoint_and_dispatch(model, load_quant, device_map="balanced")
73
+
74
+ # Inference
75
+ prompt = f'''What is the difference between nuclear fusion and fission?
76
+ ###Response:'''
77
+
78
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
79
+ output = model.generate(
80
+ inputs=input_ids,
81
+ temperature=0.7,
82
+ max_new_tokens=512,
83
+ top_p=0.15,
84
+ top_k=0,
85
+ repetition_penalty=1.1,
86
+ eos_token_id=tokenizer.eos_token_id
87
+ )
88
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
89
+ ```
90
+
91
+ ## Evaluation
92
+
93
+ This evaluation was done using [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness).
94
+
95
+ [Open-LLaMA-7B-v2-Instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
96
+
97
+ | Task |Version| Metric | Value | |Stderr|
98
+ |--------|------:|---------------|------:|---|------|
99
+ |wikitext| 1|word_perplexity|16.6822| | |
100
+ | | |byte_perplexity| 1.6927| | |
101
+ | | |bits_per_byte | 0.7593| | |
102
+
103
+ [Open-LLaMA-7B-v2-Instruct (4-bit 128-group AWQ)](https://huggingface.co/abhinavkulkarni/VMware-open-llama-7b-v2-open-instruct-w4-g128-awq)
104
+
105
+ | Task |Version| Metric | Value | |Stderr|
106
+ |--------|------:|---------------|------:|---|------|
107
+ |wikitext| 1|word_perplexity|17.1546| | |
108
+ | | |byte_perplexity| 1.7015| | |
109
+ | | |bits_per_byte | 0.7668| | |
110
+
111
+ ## Acknowledgements
112
+
113
+ If you found OpenLLaMA useful in your research or applications, please cite using the following BibTeX:
114
+ ```
115
+ @software{openlm2023openllama,
116
+ author = {Geng, Xinyang and Liu, Hao},
117
+ title = {OpenLLaMA: An Open Reproduction of LLaMA},
118
+ month = May,
119
+ year = 2023,
120
+ url = {https://github.com/openlm-research/open_llama}
121
+ }
122
+ ```
123
+ ```
124
+ @software{together2023redpajama,
125
+ author = {Together Computer},
126
+ title = {RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset},
127
+ month = April,
128
+ year = 2023,
129
+ url = {https://github.com/togethercomputer/RedPajama-Data}
130
+ }
131
+ ```
132
+ ```
133
+ @article{touvron2023llama,
134
+ title={Llama: Open and efficient foundation language models},
135
+ author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and others},
136
+ journal={arXiv preprint arXiv:2302.13971},
137
+ year={2023}
138
+ }
139
+ ```
140
+
141
+ The model was quantized with AWQ technique. If you find AWQ useful or relevant to your research, please kindly cite the paper:
142
+
143
+ ```
144
+ @article{lin2023awq,
145
+ title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
146
+ author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
147
+ journal={arXiv},
148
+ year={2023}
149
+ }
150
+ ```