TheBloke
/

falcon-40b-instruct-GPTQ

@@ -1,11 +1,13 @@
 ---
 datasets:
 - tiiuae/falcon-refinedweb
 language:
 - en
 inference: false
 ---
 <div style="width: 100%;">
     <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
 </div>
@@ -17,7 +19,7 @@ inference: false
         <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
     </div>
 </div>
 # Falcon-40B-Instruct GPTQ
@@ -29,29 +31,27 @@ It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQi
 Please note this is an experimental GPTQ model. Support for it is currently quite limited.
-It is also expected to be **VERY SLOW**. This is unavoidable at the moment, but is being looked at.
-To use it you will require:
-1. AutoGPTQ, from the latest `main` branch and compiled with `pip install .`
-2. `pip install einops`
-You can then use it immediately from Python code - see example code below - or from text-generation-webui.
-## AutoGPTQ
-To install AutoGPTQ please follow these instructions:
 ```
 git clone https://github.com/PanQiWei/AutoGPTQ
 cd AutoGPTQ
 pip install .
 ```
-These steps will require that you have the [Nvidia CUDA toolkit](https://developer.nvidia.com/cuda-12-0-1-download-archive) installed.
 ## text-generation-webui
-There is also provisional AutoGPTQ support in text-generation-webui.
 This requires text-generation-webui as of commit 204731952ae59d79ea3805a425c73dd171d943c3.
@@ -78,14 +78,9 @@ In this repo you can see two `.py` files - these are the files that get executed
 ## Simple Python example code
-To run this code you need to install AutoGPTQ from source:
-```
-git clone https://github.com/PanQiWei/AutoGPTQ
-cd AutoGPTQ
-pip install . # This step requires CUDA toolkit installed
-```
-And install einops:
 ```
 pip install einops
 ```
@@ -96,7 +91,7 @@ from transformers import AutoTokenizer
 from auto_gptq import AutoGPTQForCausalLM
 # Download the model from HF and store it locally, then reference its location here:
-quantized_model_dir = "/path/to/falcon7b-instruct-gptq"
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
@@ -113,13 +108,13 @@ print(tokenizer.decode(output[0]))
 ## Provided files
-**gptq_model-4bit.safetensors**
 This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
-It was created with no groupsize to reduce VRAM requirements as much as possible, with `desc_act` (act-order) to increase inference quality.
-* `gptq_model-4bit.safetensors`
   * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
     * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
   * Works with text-generation-webui using `--autogptq --trust_remote_code`
@@ -127,6 +122,7 @@ It was created with no groupsize to reduce VRAM requirements as much as possible
   * Does not work with any version of GPTQ-for-LLaMa
   * Parameters: Groupsize = 64. No act-order.
 ## Discord
 For further support, and discussions on these models and AI in general, join us at: [TheBloke AI's Discord server](https://discord.gg/UBgz4VXf)
@@ -144,9 +140,11 @@ Donaters will get priority support on any and all AI/LLM/model questions, plus o
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
-**Patreon special mentions**: Aemon Algiz; Talal Aujan; Jonathan Leane; Illia Dulskyi; Khalefa Al-Ahmad;
-senxiiz. Thank you all, and to all my other generous patrons and donaters.
 # ✨ Original model card: Falcon-40B-Instruct
 # ✨ Falcon-40B-Instruct

 ---
 datasets:
 - tiiuae/falcon-refinedweb
+license: apache-2.0
 language:
 - en
 inference: false
 ---
+<!-- header start -->
 <div style="width: 100%;">
     <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
 </div>
         <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
     </div>
 </div>
+<!-- header end -->
 # Falcon-40B-Instruct GPTQ
 Please note this is an experimental GPTQ model. Support for it is currently quite limited.
+It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
+## AutoGPTQ
+AutoGPTQ is required: `pip install auto-gptq`
+AutoGPTQ provides pre-compiled wheels for Windows and Linux, with CUDA toolkit 11.7 or 11.8.
+If you are running CUDA toolkit 12.x, you will need to compile your own by following these instructions:
 ```
 git clone https://github.com/PanQiWei/AutoGPTQ
 cd AutoGPTQ
 pip install .
 ```
+These manual steps will require that you have the [Nvidia CUDA toolkit](https://developer.nvidia.com/cuda-12-0-1-download-archive) installed.
 ## text-generation-webui
+There is provisional AutoGPTQ support in text-generation-webui.
 This requires text-generation-webui as of commit 204731952ae59d79ea3805a425c73dd171d943c3.
 ## Simple Python example code
+To run this code you need to install AutoGPTQ and einops:
 ```
+pip install auto-gptq
 pip install einops
 ```
 from auto_gptq import AutoGPTQForCausalLM
 # Download the model from HF and store it locally, then reference its location here:
+quantized_model_dir = "/path/to/falcon40b-instruct-gptq"
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
 ## Provided files
+**gptq_model-4bit-64g.safetensors**
 This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
+It was created with groupsize 64 to give higher inference quality, and without `desc_act` (act-order) to increase inference speed.
+* `gptq_model-4bit-64g.safetensors`
   * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
     * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
   * Works with text-generation-webui using `--autogptq --trust_remote_code`
   * Does not work with any version of GPTQ-for-LLaMa
   * Parameters: Groupsize = 64. No act-order.
+<!-- footer start -->
 ## Discord
 For further support, and discussions on these models and AI in general, join us at: [TheBloke AI's Discord server](https://discord.gg/UBgz4VXf)
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
+**Patreon special mentions**: Aemon Algiz; Johann-Peter Hartmann; Talal Aujan; Jonathan Leane; Illia Dulskyi; Khalefa Al-Ahmad; senxiiz; Sebastain Graf; Eugene Pentland; Nikolai Manek; Luke Pendergrass.
+Thank you to all my generous patrons and donaters.
+<!-- footer end -->
 # ✨ Original model card: Falcon-40B-Instruct
 # ✨ Falcon-40B-Instruct