TheBloke commited on
Commit
cd2578c
1 Parent(s): e40e5d7

New GGMLv3 format for breaking llama.cpp change May 19th commit 2d5db48

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -26,29 +26,28 @@ GGML files are for CPU inference using [llama.cpp](https://github.com/ggerganov/
26
  * [4bit and 5bit GGML models for CPU inference](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-GGML).
27
  * [float16 HF format unquantised model for GPU inference and further conversions](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-HF)
28
 
29
- ## THESE FILES REQUIRE LATEST LLAMA.CPP (May 12th 2023 - commit b9fd7ee)!
30
 
31
- llama.cpp recently made a breaking change to its quantisation methods.
32
 
33
- I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 12th or later (commit `b9fd7ee` or later) to use them.
34
-
35
- If you are currently unable to update llama.cpp, eg because you use a UI which hasn't updated yet, you can find GGML files for the previous version of llama.cpp in the `previous_llama` branch.
36
 
 
37
  ## Provided files
38
  | Name | Quant method | Bits | Size | RAM required | Use case |
39
  | ---- | ---- | ---- | ---- | ---- | ----- |
40
- `h2ogptq-oasst1-512-30B.ggml.q4_0.bin` | q4_0 | 4bit | 20.3GB | 25GB | 4-bit. |
41
- `h2ogptq-oasst1-512-30B.ggml.q4_1.bin` | q4_1 | 4bit | 24.4GB | 26GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
42
- `h2ogptq-oasst1-512-30B.ggml.q5_0.bin` | q5_0 | 5bit | 22.4GB | 25GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
43
- `h2ogptq-oasst1-512-30B.ggml.q5_1.bin` | q5_1 | 5bit | 24.4GB | 26GB | 5-bit. Even higher accuracy, and higher resource usage and slower inference.|
44
- `h2ogptq-oasst1-512-30B.ggml.q8_0.bin` | q8_0 | 8bit | 36.6GB | 39GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
45
 
46
  ## How to run in `llama.cpp`
47
 
48
  I use the following command line; adjust for your tastes and needs:
49
 
50
  ```
51
- ./main -t 8 -m h2ogptq-oasst1-512-30B.ggml.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
52
  ### Instruction:
53
  Write a story about llamas
54
  ### Response:"
@@ -63,6 +62,8 @@ GGML models can be loaded into text-generation-webui by installing the llama.cpp
63
 
64
  Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
65
 
 
 
66
  # Original h2oGPT Model Card
67
  ## Summary
68
 
@@ -284,4 +285,4 @@ Please read this disclaimer carefully before using the large language model prov
284
  - Reporting Issues: If you encounter any biased, offensive, or otherwise inappropriate content generated by the large language model, please report it to the repository maintainers through the provided channels. Your feedback will help improve the model and mitigate potential issues.
285
  - Changes to this Disclaimer: The developers of this repository reserve the right to modify or update this disclaimer at any time without prior notice. It is the user's responsibility to periodically review the disclaimer to stay informed about any changes.
286
 
287
- By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.
 
26
  * [4bit and 5bit GGML models for CPU inference](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-GGML).
27
  * [float16 HF format unquantised model for GPU inference and further conversions](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-HF)
28
 
29
+ ## THE FILES IN MAIN BRANCH REQUIRES LATEST LLAMA.CPP (May 19th 2023 - commit 2d5db48)!
30
 
31
+ llama.cpp recently made another breaking change to its quantisation methods - https://github.com/ggerganov/llama.cpp/pull/1508
32
 
33
+ I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 19th or later (commit `2d5db48` or later) to use them.
 
 
34
 
35
+ For files compatible with the previous version of llama.cpp, please see branch `previous_llama_ggmlv2`.
36
  ## Provided files
37
  | Name | Quant method | Bits | Size | RAM required | Use case |
38
  | ---- | ---- | ---- | ---- | ---- | ----- |
39
+ `h2ogptq-oasst1-512-30B.ggmlv3.q4_0.bin` | q4_0 | 4bit | 20.3GB | 25GB | 4-bit. |
40
+ `h2ogptq-oasst1-512-30B.ggmlv3.q4_1.bin` | q4_1 | 4bit | 24.4GB | 26GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
41
+ `h2ogptq-oasst1-512-30B.ggmlv3.q5_0.bin` | q5_0 | 5bit | 22.4GB | 25GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
42
+ `h2ogptq-oasst1-512-30B.ggmlv3.q5_1.bin` | q5_1 | 5bit | 24.4GB | 26GB | 5-bit. Even higher accuracy, and higher resource usage and slower inference.|
43
+ `h2ogptq-oasst1-512-30B.ggmlv3.q8_0.bin` | q8_0 | 8bit | 36.6GB | 39GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
44
 
45
  ## How to run in `llama.cpp`
46
 
47
  I use the following command line; adjust for your tastes and needs:
48
 
49
  ```
50
+ ./main -t 8 -m h2ogptq-oasst1-512-30B.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
51
  ### Instruction:
52
  Write a story about llamas
53
  ### Response:"
 
62
 
63
  Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
64
 
65
+ Note: at this time text-generation-webui may not support the new May 19th llama.cpp quantisation methods for q4_0, q4_1 and q8_0 files.
66
+
67
  # Original h2oGPT Model Card
68
  ## Summary
69
 
 
285
  - Reporting Issues: If you encounter any biased, offensive, or otherwise inappropriate content generated by the large language model, please report it to the repository maintainers through the provided channels. Your feedback will help improve the model and mitigate potential issues.
286
  - Changes to this Disclaimer: The developers of this repository reserve the right to modify or update this disclaimer at any time without prior notice. It is the user's responsibility to periodically review the disclaimer to stay informed about any changes.
287
 
288
+ By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.