leafspark
/

DeepSeek-V2-Chat-GGUF

@@ -16,13 +16,15 @@ language:
 # DeepSeek-V2-Chat-GGUF
 Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)
 Using llama.cpp [b3026](https://github.com/ggerganov/llama.cpp/releases/tag/b3026) for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.
-# Warning: This will not work unless you set metadata KV overrides, nor will it in LM Studio/similar wrapper apps (except supported ones, see below)!
-# How to use:
 **Downloading the bf16:**
@@ -75,33 +77,35 @@ quantize \
   (--imatrix [file])
 ```
-# Quants:
-```
-- bf16 [size: 439gb]
-- q8_0 (uploading) [size: 233.27gb]
-- q4_k_m [size: 132gb]
-- q2_k [size: 80gb]
-- iq2_xxs [size: 61.5gb]
-- iq3_xs [size: 89.6gb]
-- iq1_m (uploading) [size: 27.3gb]
-- q3_k_m (uploading) [size: 92.6gb]
-```
-Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected a lot.
-# Planned Quants (weighted/imatrix):
-```
-- q5_k_m
-- q5_k_s
-- q6_k
-- iq4_xs
-- iq2_xs
-- iq2_s
-- iq2_m
-- iq1_s (note: for fun only, this quant is likely useless)
-```
-Use these metadata KV overrides (pass them using `--override-kv`, can be specified multiple times):
 ```
 deepseek2.attention.q_lora_rank=int:1536
 deepseek2.attention.kv_lora_rank=int:512
@@ -112,7 +116,7 @@ deepseek2.leading_dense_block_count=int:1
 deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
 ```
-The Q8_0 quant contains these parameters, along with future ones, so as long as you're running a supported build of llama.cpp no `--override-kv` parameters are required.
 A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.
@@ -121,13 +125,13 @@ A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c50
 - MIT license for any repo code
 # Performance:
-~1.5t/s with Ryzen 3 3700x (96gb 3200mhz) [Q2_K]
 # iMatrix:
-Find imatrix.dat in the root of this repo, made with a Q2_K quant (see here for info: [https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693](https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693))
-Using groups_merged.txt, find it here: [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)
 # Censorship:
-This model is quite censored, finetuning on toxic DPO might help.

 # DeepSeek-V2-Chat-GGUF
+![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/j_LWkNdegeMjQXuAOFZ1N.jpeg)
 Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)
 Using llama.cpp [b3026](https://github.com/ggerganov/llama.cpp/releases/tag/b3026) for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.
+**If you are using an older quant, please set the metadata KV overrides below.**
+# Usage:
 **Downloading the bf16:**
   (--imatrix [file])
 ```
+Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.
+# Quants:
+| Quant    | Status      | Size      | Description                                | KV Metadata | Weighted | Notes |
+|----------|-------------|-----------|--------------------------------------------|-------------|----------|-------|
+| BF16     | Available   | 439 GB    | Lossless :)                                | Old         | No       | Q8_0 is sufficient for most cases |
+| Q8_0     | Uploading   | 233.27 GB | High quality *recommended*                 | Updated     | Yes      |       |
+| Q4_K_M   | Available   | 132 GB    | Medium quality *recommended*               | Old         | No       |       |
+| Q3_K_M   | Uploading   | 92.6 GB   | Medium-low quality                         | Updated     | Yes      |       |
+| IQ3_XS   | Available   | 89.6 GB   | Better than Q3_K_M                         | Old         | Yes      |       |
+| Q2_K     | Available   | 80.0 GB   | Low quality **not recommended**            | Old         | No       |       |
+| IQ2_XXS  | Available   | 61.5 GB   | Lower quality **not recommended**          | Old         | Yes      |       |
+| IQ1_M    | Uploading   | 27.3 GB   | Extremely low quality **not recommended**  | Old         | Yes      | Testing purposes; use IQ2 at least |
+# Planned Quants (weighted/iMatrix):
+| Planned Quant     | Notes   |
+|-------------------|---------|
+| Q5_K_M            |         |
+| Q5_K_M            |         |
+| Q6_K              |         |
+| IQ4_XS            |         |
+| IQ2_XS            |         |
+| IQ2_S             |         |
+| IQ2_M             |         |
+Metadata KV overrides (pass them using `--override-kv`, can be specified multiple times):
 ```
 deepseek2.attention.q_lora_rank=int:1536
 deepseek2.attention.kv_lora_rank=int:512
 deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
 ```
+The `Q8_0` quant contains these parameters, along with future ones, so as long as you're running a supported build of llama.cpp no `--override-kv` parameters are required.
 A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.
 - MIT license for any repo code
 # Performance:
+*~1.5t/s* with Ryzen 3 3700x (96gb 3200mhz) `[Q2_K]`
 # iMatrix:
+Find `imatrix.dat` in the root of this repo, made with a `Q2_K` quant (see here for info: [https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693](https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693))
+Using `groups_merged.txt`, find it here: [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)
 # Censorship:
+This model is a bit censored, finetuning on toxic DPO might help.