metadata

license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - code
  - deepseek
  - gguf
  - bf16
metrics:
  - accuracy
language:
  - en
  - zh

DeepSeek-V2-Chat-GGUF

Quantizised from https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat

Using llama.cpp b3026 for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.

If you are using an older quant, please set the metadata KV overrides below.

Usage:

Downloading the bf16:

Find the relevant directory
Download all files
Run merge.py
Merged GGUF should appear

Downloading the quantizations:

Find the relevant directory
Download all files
Point to the first split (most programs should load all the splits automatically now)

Running in llama.cpp:

To start in command line chat mode (chat completion):

main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)

To use llama.cpp's OpenAI compatible server:

server \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -c {context_length} \
  (--color [recommended: colored output in supported terminals]) \
  (-i [note: interactive mode]) \
  (--mlock [note: avoid using swap]) \
  (--verbose) \
  (--log-disable [note: disable logging to file, may be useful for prod]) \
  (--metrics [note: prometheus compatible monitoring endpoint]) \
  (--api-key [string]) \
  (--port [int]) \
  (--flash-attn [note: must be fully offloaded to supported GPU])

Making an importance matrix:

imatrix \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -f groups_merged.txt \
  --verbosity [0, 1, 2] \
  -ngl {GPU offloading; must build with CUDA} \
  --ofreq {recommended: 1}

Making a quant:

quantize \
  DeepSeek-V2-Chat.bf16.gguf \
  DeepSeek-V2-Chat.{quant}.gguf \
  {quant} \
  (--imatrix [file])

Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.

Quants:

Quant	Status	Size	Description	KV Metadata	Weighted	Notes
BF16	Available	439 GB	Lossless :)	Old	No	Q8_0 is sufficient for most cases
Q8_0	Available	233.27 GB	High quality recommended	Updated	Yes
Q5_K_M	Uploading	155 GB	Medium-low quality	Updated	Yes
Q4_K_M	Available	132 GB	Medium quality recommended	Old	No
Q3_K_M	Available	104 GB	Medium-low quality	Updated	Yes
IQ3_XS	Available	89.6 GB	Better than Q3_K_M	Old	Yes
Q2_K	Available	80.0 GB	Low quality not recommended	Old	No
IQ2_XXS	Available	61.5 GB	Lower quality not recommended	Old	Yes
IQ1_M	Uploading	27.3 GB	Extremely low quality not recommended	Old	Yes	Testing purposes; use IQ2 at least

Planned Quants (weighted/iMatrix):

Planned Quant	Notes
Q5_K_S
Q4_K_S
Q3_K_S
Q6_K
IQ4_XS
IQ2_XS
IQ2_S
IQ2_M

Metadata KV overrides (pass them using --override-kv, can be specified multiple times):

deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707

Quants with "Updated" metadata contain these parameters, so as long as you're running a supported build of llama.cpp no --override-kv parameters are required.

A precompiled Windows AVX2 version is avaliable at llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip in the root of this repo.

License:

DeepSeek license for model weights, which can be found in the LICENSE file in the root of this repo
MIT license for any repo code

Performance:

~1.5t/s with Ryzen 3 3700x (96gb 3200mhz) [Q2_K]

iMatrix:

Find imatrix.dat in the root of this repo, made with a Q2_K quant containing 62 chunks (see here for info: https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693)

Using groups_merged.txt, find it here: https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384

Censorship:

This model is a bit censored, finetuning on toxic DPO might help.