Update README.md
Browse files
README.md
CHANGED
@@ -21,19 +21,25 @@ inference: false
|
|
21 |
</div>
|
22 |
<!-- header end -->
|
23 |
|
24 |
-
# Falcon-40B-Instruct
|
25 |
|
26 |
-
This repo contains an experimantal GPTQ
|
27 |
|
28 |
It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
## EXPERIMENTAL
|
31 |
|
32 |
Please note this is an experimental GPTQ model. Support for it is currently quite limited.
|
33 |
|
34 |
It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
|
35 |
|
36 |
-
This is
|
37 |
|
38 |
Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
|
39 |
|
@@ -65,11 +71,11 @@ So please first update text-genration-webui to the latest version.
|
|
65 |
|
66 |
1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
|
67 |
2. Click the **Model tab**.
|
68 |
-
3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-
|
69 |
4. Click **Download**.
|
70 |
5. Wait until it says it's finished downloading.
|
71 |
6. Click the **Refresh** icon next to **Model** in the top left.
|
72 |
-
7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-
|
73 |
8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
74 |
|
75 |
## About `trust_remote_code`
|
@@ -95,7 +101,7 @@ from transformers import AutoTokenizer
|
|
95 |
from auto_gptq import AutoGPTQForCausalLM
|
96 |
|
97 |
# Download the model from HF and store it locally, then reference its location here:
|
98 |
-
quantized_model_dir = "/path/to/falcon40b-instruct-
|
99 |
|
100 |
from transformers import AutoTokenizer
|
101 |
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
|
@@ -112,13 +118,13 @@ print(tokenizer.decode(output[0]))
|
|
112 |
|
113 |
## Provided files
|
114 |
|
115 |
-
**gptq_model-
|
116 |
|
117 |
This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
|
118 |
|
119 |
It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
|
120 |
|
121 |
-
* `gptq_model-
|
122 |
* Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
|
123 |
* At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
|
124 |
* Works with text-generation-webui using `--autogptq --trust_remote_code`
|
|
|
21 |
</div>
|
22 |
<!-- header end -->
|
23 |
|
24 |
+
# Falcon-40B-Instruct 4bit GPTQ
|
25 |
|
26 |
+
This repo contains an experimantal GPTQ 4bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
27 |
|
28 |
It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
|
29 |
|
30 |
+
## Repositories available
|
31 |
+
|
32 |
+
* [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
|
33 |
+
* [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
|
34 |
+
* [Unquantised bf16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
|
35 |
+
|
36 |
## EXPERIMENTAL
|
37 |
|
38 |
Please note this is an experimental GPTQ model. Support for it is currently quite limited.
|
39 |
|
40 |
It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
|
41 |
|
42 |
+
This is 4bit model requires at least 35GB VRAM to load. It can be used on 40GB or 48GB cards, but not less.
|
43 |
|
44 |
Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
|
45 |
|
|
|
71 |
|
72 |
1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
|
73 |
2. Click the **Model tab**.
|
74 |
+
3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
|
75 |
4. Click **Download**.
|
76 |
5. Wait until it says it's finished downloading.
|
77 |
6. Click the **Refresh** icon next to **Model** in the top left.
|
78 |
+
7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
|
79 |
8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
80 |
|
81 |
## About `trust_remote_code`
|
|
|
101 |
from auto_gptq import AutoGPTQForCausalLM
|
102 |
|
103 |
# Download the model from HF and store it locally, then reference its location here:
|
104 |
+
quantized_model_dir = "/path/to/falcon40b-instruct-GPTQ"
|
105 |
|
106 |
from transformers import AutoTokenizer
|
107 |
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
|
|
|
118 |
|
119 |
## Provided files
|
120 |
|
121 |
+
**gptq_model-4bit--1g.safetensors**
|
122 |
|
123 |
This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
|
124 |
|
125 |
It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
|
126 |
|
127 |
+
* `gptq_model-4bit--1g.safetensors`
|
128 |
* Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
|
129 |
* At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
|
130 |
* Works with text-generation-webui using `--autogptq --trust_remote_code`
|