TheBloke
/

airoboros-7b-gpt4-GGML

Model card Files Files and versions Community

TheBloke commited on Jun 7, 2023

Commit

f00947a

•

1 Parent(s): c0384a7

Upload new k-quant GGML quantised models.

Browse files

Files changed (1) hide show

README.md +66 -56

README.md CHANGED Viewed

@@ -1,8 +1,6 @@
 ---
 inference: false
 license: other
-datasets:
-- jondurbin/airoboros-gpt4
 ---
 <!-- header start -->
@@ -19,9 +17,9 @@ datasets:
 </div>
 <!-- header end -->
-# Jon Durbin's Airoboros 7b GPT4 GGML
-These files are GGML format model files for [Jon Durbin's Airoboros 7b GPT4](https://huggingface.co/jondurbin/airoboros-7b-gpt4).
 GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
 * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
@@ -32,45 +30,55 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
 ## Repositories available
-* [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/airoboros-7b-gpt4-GPTQ)
-* [4-bit, 5-bit, and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airoboros-7b-gpt4-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/airoboros-7b-gpt4-fp16)
-### Prompt template
-This uses Vicuna 1.1 format.  Example:
-```
-USER: prompt
-ASSISTANT:
-```
-## Context length with GGML
-The base Airoboros GPT4 models have an increased context length of 4096.
-However this GGML conversion appears to still have the default 2048 context.
-I have experimented with llama.cpp's `-n 4096` parameter to specify a context of 4096 but it so far always results in gibberish output.
-I will investigate this further and upload a correct model if this proves necessary.
-For now, please assume this GGML to have a context of 2048.
-## THE FILES IN MAIN BRANCH REQUIRES LATEST LLAMA.CPP (May 19th 2023 - commit 2d5db48)!
-llama.cpp recently made another breaking change to its quantisation methods - https://github.com/ggerganov/llama.cpp/pull/1508
-I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 19th or later (commit `2d5db48` or later) to use them.
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-| airoboros-7b-gpt4.ggmlv3.q4_0.bin | q4_0 | 4 | 3.79 GB | 6.29 GB | 4-bit. |
-| airoboros-7b-gpt4.ggmlv3.q4_1.bin | q4_1 | 4 | 4.21 GB | 6.71 GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
-| airoboros-7b-gpt4.ggmlv3.q5_0.bin | q5_0 | 5 | 4.63 GB | 7.13 GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
-| airoboros-7b-gpt4.ggmlv3.q5_1.bin | q5_1 | 5 | 5.06 GB | 7.56 GB | 5-bit. Even higher accuracy, resource usage and slower inference. |
-| airoboros-7b-gpt4.ggmlv3.q8_0.bin | q8_0 | 8 | 7.16 GB | 9.66 GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
@@ -80,7 +88,7 @@ I have quantised the GGML files in this repo with the latest version. Therefore
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -ngl 32 -m airoboros-7b-gpt4.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER: Write a story about llamas\nASSISTANT:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
@@ -112,19 +120,21 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
-**Patreon special mentions**: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman.
 Thank you to all my generous patrons and donaters!
 <!-- footer end -->
-# Original model card: Jon Durbin's Airoboros 7b GPT4
 ## Overview
 This is a fine-tuned 7b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros
-The context size has been increased to 4096.
 The dataset used to fine-tune this model is available [here](https://huggingface.co/airoboros-gpt4), with a specific focus on:
 - trivia
 - math/reasoning (although it still sucks)
@@ -253,7 +263,7 @@ Once exploited for logging and mining, today the sparsely populated Upper Penins
 The Lower Peninsula is a hub for manufacturing, forestry, agriculture, services, and high-tech industry.
 ENDINPUT
 BEGININSTRUCTION
-What is Michigan's capital?
 Give me a JSON object with 3 facts from Foolo's article.
 What are the states that border Michigan?
 Reference?
@@ -499,12 +509,12 @@ void insert(char* key, char* value) {
         perror("malloc failed");
         exit(1);
     }
     strcpy(parent->key, key);
     strcpy(parent->value, value);
     parent->left = parent->right = NULL;
     if (!tree) {
         tree = parent;
     } else {
@@ -520,7 +530,7 @@ char* search(char* key) {
         }
         current = current->right;
     }
     return NULL;
 }
@@ -535,17 +545,17 @@ void print_tree() {
 void* accept_client(void* args) {
     int client_socket = *(int*)args;
     char buffer[1024];
     while (1) {
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         char* input = buffer;
         char* input_token;
         while ((input_token = strtok(input, " \n")) != NULL) {
             switch (input_token[0]) {
                 case 'G':
@@ -555,7 +565,7 @@ void* accept_client(void* args) {
                     }
                     printf("%s = %s\n", input_token, search(input_token));
                     break;
                 case 'S':
                     if (!strcmp(input_token, "SET")) {
                         char* key = strtok(NULL, " \n");
@@ -564,7 +574,7 @@ void* accept_client(void* args) {
                             perror("Invalid input");
                             exit(1);
                         }
                         insert(key, value);
                         printf("%s set %s to %s\n", input_token, key, value);
                     } else {
@@ -572,16 +582,16 @@ void* accept_client(void* args) {
                         exit(1);
                     }
                     break;
                 default:
                     perror("Invalid input");
                     exit(1);
             }
             input = strtok(NULL, " \n");
         }
     }
     return NULL;
 }
@@ -591,50 +601,50 @@ int main() {
         perror("socket failed");
         exit(1);
     }
     struct sockaddr_in server_addr;
     memset(&server_addr, 0, sizeof(server_addr));
     server_addr.sin_family = AF_INET;
     server_addr.sin_port = htons(8080);
     server_addr.sin_addr.s_addr = INADDR_ANY;
     if (bind(server_socket, (struct sockaddr*)&server_addr, sizeof(server_addr)) < 0) {
         perror("bind failed");
         exit(1);
     }
     if (listen(server_socket, 5) < 0) {
         perror("listen failed");
         exit(1);
     }
     pthread_t accept_thread;
     pthread_create(&accept_thread, NULL, accept_client, &server_socket);
     char* client_input;
     int client_socket = accept(server_socket, (struct sockaddr*)NULL, NULL);
     if (client_socket < 0) {
         perror("accept failed");
         exit(1);
     }
     while (1) {
         sleep(1);
         char buffer[1024];
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         client_input = buffer;
         parse_input(client_input);
     }
     close(client_socket);
     pthread_join(accept_thread, NULL);
     return 0;
 }
 ```

 ---
 inference: false
 license: other
 ---
 <!-- header start -->
 </div>
 <!-- header end -->
+# Jon Durbin's Airoboros 7B GPT4 GGML
+These files are GGML format model files for [Jon Durbin's Airoboros 7B GPT4](https://huggingface.co/jondurbin/airoboros-7b-gpt4).
 GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
 * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
 ## Repositories available
+* [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/airoboros-7B-gpt4-GPTQ)
+* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airoboros-7B-gpt4-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/airoboros-7b-gpt4-fp16)
+<!-- compatibility_ggml start -->
+## Compatibility
+### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
+I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit `2d5db48`.
+They should be compatible with all current UIs and libraries that use llama.cpp, such as those listed at the top of this README.
+### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
+These new quantisation methods are only compatible with llama.cpp as of June 6th, commit `2d43387`.
+They will NOT be compatible with koboldcpp, text-generation-ui, and other UIs and libraries yet. Support is expected to come over the next few days.
+## Explanation of the new k-quant methods
+The new methods available are:
+* GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
+* GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
+* GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
+* GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
+* GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
+* GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
+Refer to the Provided Files table below to see what files use which methods, and how.
+<!-- compatibility_ggml end -->
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| airoboros-7B.ggmlv3.q2_K.bin | q2_K | 2 | 2.80 GB | 5.30 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
+| airoboros-7B.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 3.55 GB | 6.05 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| airoboros-7B.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 3.23 GB | 5.73 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| airoboros-7B.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 2.90 GB | 5.40 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
+| airoboros-7B.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 4.05 GB | 6.55 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
+| airoboros-7B.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 3.79 GB | 6.29 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
+| airoboros-7B.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 4.77 GB | 7.27 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
+| airoboros-7B.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 4.63 GB | 7.13 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
+| airoboros-7B.ggmlv3.q6_K.bin | q6_K | 6 | 5.53 GB | 8.03 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
+| airoboros-7b-gpt4.ggmlv3.q4_0.bin | q4_0 | 4 | 3.79 GB | 6.29 GB | Original llama.cpp quant method, 4-bit. |
+| airoboros-7b-gpt4.ggmlv3.q4_1.bin | q4_1 | 4 | 4.21 GB | 6.71 GB | Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
+| airoboros-7b-gpt4.ggmlv3.q5_0.bin | q5_0 | 5 | 4.63 GB | 7.13 GB | Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
+| airoboros-7b-gpt4.ggmlv3.q5_1.bin | q5_1 | 5 | 5.06 GB | 7.56 GB | Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
+| airoboros-7b-gpt4.ggmlv3.q8_0.bin | q8_0 | 8 | 7.16 GB | 9.66 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -t 10 -ngl 32 -m airoboros-7B.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
+**Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
+**Patreon special mentions**: Ajan Kanaga, Kalila, Derek Yates, Sean Connelly, Luke, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, trip7s trip, Jonathan Leane, Talal Aujan, Artur Olbinski, Cory Kujawski, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Johann-Peter Hartmann.
 Thank you to all my generous patrons and donaters!
 <!-- footer end -->
+# Original model card: Jon Durbin's Airoboros 7B GPT4
 ## Overview
 This is a fine-tuned 7b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros
 The dataset used to fine-tune this model is available [here](https://huggingface.co/airoboros-gpt4), with a specific focus on:
 - trivia
 - math/reasoning (although it still sucks)
 The Lower Peninsula is a hub for manufacturing, forestry, agriculture, services, and high-tech industry.
 ENDINPUT
 BEGININSTRUCTION
+What is Michigan's capital?
 Give me a JSON object with 3 facts from Foolo's article.
 What are the states that border Michigan?
 Reference?
         perror("malloc failed");
         exit(1);
     }
     strcpy(parent->key, key);
     strcpy(parent->value, value);
     parent->left = parent->right = NULL;
     if (!tree) {
         tree = parent;
     } else {
         }
         current = current->right;
     }
     return NULL;
 }
 void* accept_client(void* args) {
     int client_socket = *(int*)args;
     char buffer[1024];
     while (1) {
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         char* input = buffer;
         char* input_token;
         while ((input_token = strtok(input, " \n")) != NULL) {
             switch (input_token[0]) {
                 case 'G':
                     }
                     printf("%s = %s\n", input_token, search(input_token));
                     break;
                 case 'S':
                     if (!strcmp(input_token, "SET")) {
                         char* key = strtok(NULL, " \n");
                             perror("Invalid input");
                             exit(1);
                         }
                         insert(key, value);
                         printf("%s set %s to %s\n", input_token, key, value);
                     } else {
                         exit(1);
                     }
                     break;
                 default:
                     perror("Invalid input");
                     exit(1);
             }
             input = strtok(NULL, " \n");
         }
     }
     return NULL;
 }
         perror("socket failed");
         exit(1);
     }
     struct sockaddr_in server_addr;
     memset(&server_addr, 0, sizeof(server_addr));
     server_addr.sin_family = AF_INET;
     server_addr.sin_port = htons(8080);
     server_addr.sin_addr.s_addr = INADDR_ANY;
     if (bind(server_socket, (struct sockaddr*)&server_addr, sizeof(server_addr)) < 0) {
         perror("bind failed");
         exit(1);
     }
     if (listen(server_socket, 5) < 0) {
         perror("listen failed");
         exit(1);
     }
     pthread_t accept_thread;
     pthread_create(&accept_thread, NULL, accept_client, &server_socket);
     char* client_input;
     int client_socket = accept(server_socket, (struct sockaddr*)NULL, NULL);
     if (client_socket < 0) {
         perror("accept failed");
         exit(1);
     }
     while (1) {
         sleep(1);
         char buffer[1024];
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         client_input = buffer;
         parse_input(client_input);
     }
     close(client_socket);
     pthread_join(accept_thread, NULL);
     return 0;
 }
 ```