Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -35,9 +35,20 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
35 |
|
36 |
Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
|
37 |
|
38 |
-
##
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
```
|
43 |
pip3 install git+https://github.com/huggingface/transformers
|
@@ -45,13 +56,11 @@ pip3 install git+https://github.com/huggingface/transformers
|
|
45 |
|
46 |
If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
|
47 |
|
48 |
-
Note that at the time of writing, ExLlama is not yet compatible with the Llama 2 70B models, but support is coming soon.
|
49 |
|
50 |
## Repositories available
|
51 |
|
52 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)
|
53 |
-
* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/
|
54 |
-
* [My fp16 conversion of the unquantised PTH model files](https://huggingface.co/TheBloke/Llama-2-70B-chat-fp16)
|
55 |
|
56 |
## Prompt template: Llama-2-Chat
|
57 |
|
@@ -69,7 +78,7 @@ Each separate quant is in a different branch. See below for instructions on fet
|
|
69 |
|
70 |
| Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
|
71 |
| ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
|
72 |
-
| main | 4 | 128 | False |
|
73 |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
74 |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
75 |
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
@@ -87,19 +96,33 @@ git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/L
|
|
87 |
```
|
88 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
89 |
|
90 |
-
|
91 |
|
92 |
-
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
93 |
|
94 |
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
95 |
|
96 |
-
|
97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
```
|
99 |
-
|
|
|
100 |
```
|
|
|
|
|
|
|
101 |
|
102 |
-
|
103 |
|
104 |
1. Click the **Model tab**.
|
105 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
|
@@ -107,7 +130,7 @@ ExLlama is not currently compatible with Llama 2 70B but support is expected soo
|
|
107 |
- see Provided Files above for the list of branches for each option.
|
108 |
3. Click **Download**.
|
109 |
4. The model will start downloading. Once it's finished it will say "Done"
|
110 |
-
5. Set Loader to AutoGPTQ or GPTQ-for-LLaMA
|
111 |
- If you use AutoGPTQ, make sure "No inject fused attention" is ticked
|
112 |
6. In the top left, click the refresh icon next to **Model**.
|
113 |
7. In the **Model** dropdown, choose the model you just downloaded: `TheBloke/Llama-2-70B-chat-GPTQ`
|
@@ -201,7 +224,9 @@ print(pipe(prompt_template)[0]['generated_text'])
|
|
201 |
|
202 |
The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
|
203 |
|
204 |
-
ExLlama is
|
|
|
|
|
205 |
|
206 |
<!-- footer start -->
|
207 |
## Discord
|
|
|
35 |
|
36 |
Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
|
37 |
|
38 |
+
## ExLlama support for 70B is here!
|
39 |
|
40 |
+
As of [this commit](https://github.com/turboderp/exllama/commit/b3aea521859b83cfd889c4c00c05a323313b7fee), ExLlama has support for Llama 2 70B models.
|
41 |
+
|
42 |
+
Please make sure you update ExLlama to the latest version. If you are a text-generation-webui one-click user, you must first uninstall the ExLlama wheel, then clone ExLlama into `text-generation-webui/repositories`; full instructions are below.
|
43 |
+
|
44 |
+
Now that we have ExLlama, that is the recommended loader to use for these models, as performance should be better than with AutoGPTQ and GPTQ-for-LLaMa, and you will be able to use the higher accuracy models, eg 128g + Act-Order.
|
45 |
+
|
46 |
+
Reminder: ExLlama does not support 3-bit models, so if you wish to try those quants, you will need to use AutoGPTQ or GPTQ-for-LLaMa.
|
47 |
+
|
48 |
+
|
49 |
+
## AutoGPTQ and GPTQ-for-LLaMa requires latest version of Transformers
|
50 |
+
|
51 |
+
If you plan to use any of these quants with AutoGPTQ or GPTQ-for-LLaMa, you will need to update Transformers to the latest Github code:
|
52 |
|
53 |
```
|
54 |
pip3 install git+https://github.com/huggingface/transformers
|
|
|
56 |
|
57 |
If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
|
58 |
|
|
|
59 |
|
60 |
## Repositories available
|
61 |
|
62 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)
|
63 |
+
* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Llama-2-70B-chat-fp16)
|
|
|
64 |
|
65 |
## Prompt template: Llama-2-Chat
|
66 |
|
|
|
78 |
|
79 |
| Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
|
80 |
| ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
|
81 |
+
| main | 4 | 128 | False | 35332232264.00 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
82 |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
83 |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
84 |
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
|
|
96 |
```
|
97 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
98 |
|
99 |
+
### How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
100 |
|
101 |
+
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui), which includes support for Llama 2 models.
|
102 |
|
103 |
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
104 |
|
105 |
+
### Use ExLlama (4-bit models only) - recommended option if you have enough VRAM for 4-bit
|
106 |
|
107 |
+
ExLlama has now been updated to support Llama 2 70B, but you will need to update ExLlama to the latest version.
|
108 |
+
|
109 |
+
By default text-generation-webui installs a pre-compiled wheel for ExLlama. Until text-generation-webui updates to reflect the ExLlama changes - which hopefully won't be long - you must uninstall that and then clone ExLlama into the `text-generation-webui/repositories` directory. ExLlama will then compile its kernel on model load.
|
110 |
+
|
111 |
+
Note that this requires that your system is capable of compiling CUDA extensions, which may be an issue on Windows.
|
112 |
+
|
113 |
+
Instructions for Linux One Click Installer:
|
114 |
+
|
115 |
+
1. Change directory into the text-generation-webui main folder: `cd /path/to/text-generation-webui`
|
116 |
+
2. Activate the conda env of text-generation-webui:
|
117 |
```
|
118 |
+
source "installer_files/conda/etc/profile.d/conda.sh"
|
119 |
+
conda activate installer_files/env
|
120 |
```
|
121 |
+
3. Run: `pip3 uninstall exllama`
|
122 |
+
4. Run: `cd repositories/exllama` followed by `git pull` to update exllama.
|
123 |
+
6. Now launch text-generation-webui and follow the instructions below for downloading and running the model. ExLlama should build its kernel when the model first loads.
|
124 |
|
125 |
+
### Downloading and running the model in text-generation-webui
|
126 |
|
127 |
1. Click the **Model tab**.
|
128 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
|
|
|
130 |
- see Provided Files above for the list of branches for each option.
|
131 |
3. Click **Download**.
|
132 |
4. The model will start downloading. Once it's finished it will say "Done"
|
133 |
+
5. Set Loader to ExLlama if you plan to use a 4-bit file, or else choose AutoGPTQ or GPTQ-for-LLaMA.
|
134 |
- If you use AutoGPTQ, make sure "No inject fused attention" is ticked
|
135 |
6. In the top left, click the refresh icon next to **Model**.
|
136 |
7. In the **Model** dropdown, choose the model you just downloaded: `TheBloke/Llama-2-70B-chat-GPTQ`
|
|
|
224 |
|
225 |
The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
|
226 |
|
227 |
+
ExLlama is now compatible with Llama 2 70B models, as of [this commit](https://github.com/turboderp/exllama/commit/b3aea521859b83cfd889c4c00c05a323313b7fee).
|
228 |
+
|
229 |
+
Please see the Provided Files table above for per-file compatibility.
|
230 |
|
231 |
<!-- footer start -->
|
232 |
## Discord
|