TheBloke commited on
Commit
6644757
1 Parent(s): 9480049

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -67
README.md CHANGED
@@ -45,11 +45,12 @@ This repo contains AWQ model files for [Mistral AI's Mistral 7B Instruct v0.1](h
45
 
46
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
47
 
48
- It is also now supported by continuous batching server [vLLM](https://github.com/vllm-project/vllm), allowing use of Llama AWQ models for high-throughput concurrent inference in multi-user server scenarios.
49
 
50
- As of September 25th 2023, preliminary Llama-only AWQ support has also been added to [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference).
 
 
51
 
52
- Note that, at the time of writing, overall throughput is still lower than running vLLM or TGI with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.
53
  <!-- description end -->
54
  <!-- repositories-available start -->
55
  ## Repositories available
@@ -83,74 +84,22 @@ Models are released as sharded safetensors files.
83
 
84
  <!-- README_AWQ.md-provided-files end -->
85
 
86
- <!-- README_AWQ.md-use-from-vllm start -->
87
- ## Serving this model from vLLM
88
-
89
- Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
90
-
91
- - When using vLLM as a server, pass the `--quantization awq` parameter, for example:
92
-
93
- ```shell
94
- python3 python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-Instruct-v0.1-AWQ --quantization awq --dtype half
95
- ```
96
-
97
- Note: at the time of writing, vLLM has not yet done a new release with support for the `quantization` parameter.
98
-
99
- If you try the code below and get an error about `quantization` being unrecognised, please install vLLM from Github source.
100
-
101
- When using vLLM from Python code, pass the `quantization=awq` parameter, for example:
102
-
103
- ```python
104
- from vllm import LLM, SamplingParams
105
-
106
- prompts = [
107
- "Hello, my name is",
108
- "The president of the United States is",
109
- "The capital of France is",
110
- "The future of AI is",
111
- ]
112
- sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
113
-
114
- llm = LLM(model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ", quantization="awq", dtype="half")
115
-
116
- outputs = llm.generate(prompts, sampling_params)
117
-
118
- # Print the outputs.
119
- for output in outputs:
120
- prompt = output.prompt
121
- generated_text = output.outputs[0].text
122
- print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
123
- ```
124
- <!-- README_AWQ.md-use-from-vllm start -->
125
-
126
  <!-- README_AWQ.md-use-from-python start -->
127
- ## Serving this model from TGI
128
-
129
- TGI merged support for AWQ on September 25th, 2023. At the time of writing you need to use the `:latest` Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
130
-
131
- Add the parameter `--quantize awq` for AWQ support.
132
-
133
- Example parameters:
134
- ```shell
135
- --model-id TheBloke/Mistral-7B-Instruct-v0.1-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
136
- ```
137
-
138
  ## How to use this AWQ model from Python code
139
 
140
  ### Install the necessary packages
141
 
142
- Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.0.2 or later
143
 
144
- ```shell
145
- pip3 install autoawq
146
- ```
147
-
148
- If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
149
 
150
  ```shell
151
- pip3 uninstall -y autoawq
 
152
  git clone https://github.com/casper-hansen/AutoAWQ
153
  cd AutoAWQ
 
154
  pip3 install .
155
  ```
156
 
@@ -160,7 +109,7 @@ pip3 install .
160
  from awq import AutoAWQForCausalLM
161
  from transformers import AutoTokenizer
162
 
163
- model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"
164
 
165
  # Load model
166
  model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
@@ -168,7 +117,7 @@ model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
168
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
169
 
170
  prompt = "Tell me about AI"
171
- prompt_template=f'''<s>[INST] {prompt} [/INST]
172
 
173
  '''
174
 
@@ -220,10 +169,6 @@ print(pipe(prompt_template)[0]['generated_text'])
220
  The files provided are tested to work with:
221
 
222
  - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
223
- - [vLLM](https://github.com/vllm-project/vllm)
224
- - [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
225
-
226
- TGI merged AWQ support on September 25th, 2023: [TGI PR #1054](https://github.com/huggingface/text-generation-inference/pull/1054). Use the `:latest` Docker container until the next TGI release is made.
227
 
228
  <!-- README_AWQ.md-compatibility end -->
229
 
 
45
 
46
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
47
 
48
+ ### Mistral AWQs
49
 
50
+ These are experimental first AWQs for the brand-new model format, Mistral.
51
+
52
+ They will not work from vLLM or TGI. They can only be used from AutoAWQ, and they require installing both AutoAWQ and Transformers from Github. More details are below.
53
 
 
54
  <!-- description end -->
55
  <!-- repositories-available start -->
56
  ## Repositories available
 
84
 
85
  <!-- README_AWQ.md-provided-files end -->
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  <!-- README_AWQ.md-use-from-python start -->
 
 
 
 
 
 
 
 
 
 
 
88
  ## How to use this AWQ model from Python code
89
 
90
  ### Install the necessary packages
91
 
92
+ Requires:
93
 
94
+ - Transformers from [commit 72958fcd3c98a7afdc61f953aa58c544ebda2f79](https://github.com/huggingface/transformers/commit/72958fcd3c98a7afdc61f953aa58c544ebda2f79)
95
+ - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) from [PR #79](https://github.com/casper-hansen/AutoAWQ/pull/79).
 
 
 
96
 
97
  ```shell
98
+ pip3 install git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79
99
+
100
  git clone https://github.com/casper-hansen/AutoAWQ
101
  cd AutoAWQ
102
+ git checkout mistral
103
  pip3 install .
104
  ```
105
 
 
109
  from awq import AutoAWQForCausalLM
110
  from transformers import AutoTokenizer
111
 
112
+ model_name_or_path = "TheBloke/Mistral-7B-v0.1-AWQ"
113
 
114
  # Load model
115
  model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
 
117
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
118
 
119
  prompt = "Tell me about AI"
120
+ prompt_template=f'''{prompt}
121
 
122
  '''
123
 
 
169
  The files provided are tested to work with:
170
 
171
  - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
 
 
 
 
172
 
173
  <!-- README_AWQ.md-compatibility end -->
174