File size: 5,346 Bytes

c47b800
306744b
c47b800
 
306744b
 
 
 
 
 
 
9afce9c
 
 
c47b800
 
a1f286f
c47b800
4e6f75c
 
c47b800
 
3734c9a
c47b800
4e6f75c
c47b800
4e6f75c
c47b800
0ceb327
 
c47b800
 
 
 
 
0ceb327
 
 
 
 
 
 
3734c9a
0ceb327
3734c9a
0ceb327
3734c9a
0ceb327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1f286f
0ceb327
 
 
 
 
 
 
 
 
 
4e6f75c
3ef8d77
4e6f75c
29a06ae
4e6f75c
 
 
2539e2c
 
4e6f75c
2539e2c
4e6f75c
 
 
 
 
 
 
 
 
 
2539e2c
 
 
4e6f75c
 
 
 
 
 
 
29a06ae
 
 
 
 
a1f286f
29a06ae
9afce9c
f83b696
 
2539e2c
3734c9a
2539e2c
f83b696
 
3734c9a
f83b696
 
 
4e6f75c
f83b696
 
2539e2c
f83b696
4e6f75c
f83b696
 
 
4e6f75c

---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- deepseek
- gguf
- bf16
metrics:
- accuracy
language:
- en
- zh
---

# DeepSeek-V2-Chat-GGUF

![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/j_LWkNdegeMjQXuAOFZ1N.jpeg)

Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)

Using llama.cpp [b3026](https://github.com/ggerganov/llama.cpp/releases/tag/b3026) for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.

**If you are using an older quant, please set the metadata KV overrides below.**

# Usage:

**Downloading the bf16:**

- Find the relevant directory
- Download all files
- Run merge.py
- Merged GGUF should appear

**Downloading the quantizations:**
- Find the relevant directory
- Download all files
- Point to the first split (most programs should load all the splits automatically now)

**Running in llama.cpp:**

To start in command line chat mode (chat completion):
```
main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)
```
To use llama.cpp's OpenAI compatible server:
```
server \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -c {context_length} \
  (--color [recommended: colored output in supported terminals]) \
  (-i [note: interactive mode]) \
  (--mlock [note: avoid using swap]) \
  (--verbose) \
  (--log-disable [note: disable logging to file, may be useful for prod]) \
  (--metrics [note: prometheus compatible monitoring endpoint]) \
  (--api-key [string]) \
  (--port [int]) \
  (--flash-attn [note: must be fully offloaded to supported GPU])
```
Making an importance matrix:
```
imatrix \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -f groups_merged.txt \
  --verbosity [0, 1, 2] \
  -ngl {GPU offloading; must build with CUDA} \
  --ofreq {recommended: 1}
```
Making a quant:
```
quantize \
  DeepSeek-V2-Chat.bf16.gguf \
  DeepSeek-V2-Chat.{quant}.gguf \
  {quant} \
  (--imatrix [file])
```

Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.

# Quants:

| Quant    | Status      | Size      | Description                                | KV Metadata | Weighted | Notes |
|----------|-------------|-----------|--------------------------------------------|-------------|----------|-------|
| BF16     | Available   | 439 GB    | Lossless :)                                | Old         | No       | Q8_0 is sufficient for most cases |
| Q8_0     | Available   | 233.27 GB | High quality *recommended*                 | Updated     | Yes      |       |
| Q5_K_M   | Uploading   | 155 GB    | Medium-low quality                         | Updated     | Yes      |       |
| Q4_K_M   | Available   | 132 GB    | Medium quality *recommended*               | Old         | No       |       |
| Q3_K_M   | Available   | 104 GB    | Medium-low quality                         | Updated     | Yes      |       |
| IQ3_XS   | Available   | 89.6 GB   | Better than Q3_K_M                         | Old         | Yes      |       |
| Q2_K     | Available   | 80.0 GB   | Low quality **not recommended**            | Old         | No       |       |
| IQ2_XXS  | Available   | 61.5 GB   | Lower quality **not recommended**          | Old         | Yes      |       |
| IQ1_M    | Uploading   | 27.3 GB   | Extremely low quality **not recommended**  | Old         | Yes      | Testing purposes; use IQ2 at least |


# Planned Quants (weighted/iMatrix):

| Planned Quant     | Notes   |
|-------------------|---------|
| Q5_K_S            |         |
| Q4_K_S            |         |
| Q3_K_S            |         |
| Q6_K              |         |
| IQ4_XS            |         |
| IQ2_XS            |         |
| IQ2_S             |         |
| IQ2_M             |         |

Metadata KV overrides (pass them using `--override-kv`, can be specified multiple times):
```
deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
```

Quants with "Updated" metadata contain these parameters, so as long as you're running a supported build of llama.cpp no `--override-kv` parameters are required.

A precompiled Windows AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.

# License:
- DeepSeek license for model weights, which can be found in the `LICENSE` file in the root of this repo
- MIT license for any repo code

# Performance:
*~1.5t/s* with Ryzen 3 3700x (96gb 3200mhz) `[Q2_K]`

# iMatrix:
Find `imatrix.dat` in the root of this repo, made with a `Q2_K` quant containing 62 chunks (see here for info: [https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693](https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693))

Using `groups_merged.txt`, find it here: [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)

# Censorship:

This model is a bit censored, finetuning on toxic DPO might help.