File size: 5,346 Bytes
c47b800 306744b c47b800 306744b 9afce9c c47b800 a1f286f c47b800 4e6f75c c47b800 3734c9a c47b800 4e6f75c c47b800 4e6f75c c47b800 0ceb327 c47b800 0ceb327 3734c9a 0ceb327 3734c9a 0ceb327 3734c9a 0ceb327 a1f286f 0ceb327 4e6f75c 3ef8d77 4e6f75c 29a06ae 4e6f75c 2539e2c 4e6f75c 2539e2c 4e6f75c 2539e2c 4e6f75c 29a06ae a1f286f 29a06ae 9afce9c f83b696 2539e2c 3734c9a 2539e2c f83b696 3734c9a f83b696 4e6f75c f83b696 2539e2c f83b696 4e6f75c f83b696 4e6f75c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- deepseek
- gguf
- bf16
metrics:
- accuracy
language:
- en
- zh
---
# DeepSeek-V2-Chat-GGUF
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/j_LWkNdegeMjQXuAOFZ1N.jpeg)
Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)
Using llama.cpp [b3026](https://github.com/ggerganov/llama.cpp/releases/tag/b3026) for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.
**If you are using an older quant, please set the metadata KV overrides below.**
# Usage:
**Downloading the bf16:**
- Find the relevant directory
- Download all files
- Run merge.py
- Merged GGUF should appear
**Downloading the quantizations:**
- Find the relevant directory
- Download all files
- Point to the first split (most programs should load all the splits automatically now)
**Running in llama.cpp:**
To start in command line chat mode (chat completion):
```
main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)
```
To use llama.cpp's OpenAI compatible server:
```
server \
-m DeepSeek-V2-Chat.{quant}.gguf \
-c {context_length} \
(--color [recommended: colored output in supported terminals]) \
(-i [note: interactive mode]) \
(--mlock [note: avoid using swap]) \
(--verbose) \
(--log-disable [note: disable logging to file, may be useful for prod]) \
(--metrics [note: prometheus compatible monitoring endpoint]) \
(--api-key [string]) \
(--port [int]) \
(--flash-attn [note: must be fully offloaded to supported GPU])
```
Making an importance matrix:
```
imatrix \
-m DeepSeek-V2-Chat.{quant}.gguf \
-f groups_merged.txt \
--verbosity [0, 1, 2] \
-ngl {GPU offloading; must build with CUDA} \
--ofreq {recommended: 1}
```
Making a quant:
```
quantize \
DeepSeek-V2-Chat.bf16.gguf \
DeepSeek-V2-Chat.{quant}.gguf \
{quant} \
(--imatrix [file])
```
Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.
# Quants:
| Quant | Status | Size | Description | KV Metadata | Weighted | Notes |
|----------|-------------|-----------|--------------------------------------------|-------------|----------|-------|
| BF16 | Available | 439 GB | Lossless :) | Old | No | Q8_0 is sufficient for most cases |
| Q8_0 | Available | 233.27 GB | High quality *recommended* | Updated | Yes | |
| Q5_K_M | Uploading | 155 GB | Medium-low quality | Updated | Yes | |
| Q4_K_M | Available | 132 GB | Medium quality *recommended* | Old | No | |
| Q3_K_M | Available | 104 GB | Medium-low quality | Updated | Yes | |
| IQ3_XS | Available | 89.6 GB | Better than Q3_K_M | Old | Yes | |
| Q2_K | Available | 80.0 GB | Low quality **not recommended** | Old | No | |
| IQ2_XXS | Available | 61.5 GB | Lower quality **not recommended** | Old | Yes | |
| IQ1_M | Uploading | 27.3 GB | Extremely low quality **not recommended** | Old | Yes | Testing purposes; use IQ2 at least |
# Planned Quants (weighted/iMatrix):
| Planned Quant | Notes |
|-------------------|---------|
| Q5_K_S | |
| Q4_K_S | |
| Q3_K_S | |
| Q6_K | |
| IQ4_XS | |
| IQ2_XS | |
| IQ2_S | |
| IQ2_M | |
Metadata KV overrides (pass them using `--override-kv`, can be specified multiple times):
```
deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
```
Quants with "Updated" metadata contain these parameters, so as long as you're running a supported build of llama.cpp no `--override-kv` parameters are required.
A precompiled Windows AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.
# License:
- DeepSeek license for model weights, which can be found in the `LICENSE` file in the root of this repo
- MIT license for any repo code
# Performance:
*~1.5t/s* with Ryzen 3 3700x (96gb 3200mhz) `[Q2_K]`
# iMatrix:
Find `imatrix.dat` in the root of this repo, made with a `Q2_K` quant containing 62 chunks (see here for info: [https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693](https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693))
Using `groups_merged.txt`, find it here: [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)
# Censorship:
This model is a bit censored, finetuning on toxic DPO might help. |