gghfez
/

DeepSeek-R1-11446-Q4_K

Inference Endpoints

Model card Files Files and versions Community

gghfez commited on 11 days ago

Commit

6ffd157

·

verified ·

1 Parent(s): 0505a68

Update README.md

Files changed (1) hide show

README.md +69 -1

README.md CHANGED Viewed

@@ -2,4 +2,72 @@
 license: apache-2.0
 base_model:
 - deepseek-ai/DeepSeek-R1
----

 license: apache-2.0
 base_model:
 - deepseek-ai/DeepSeek-R1
+---
+# Q4_K Quant of Deepseek-R1 for the MLA fork pull request
+## Requires this custom build of llama.cpp:
+https://github.com/ggerganov/llama.cpp/pull/11446
+** IMPORTANT NOTE **
+If you try to load this with the `main` branch of llama.cpp you'll see an error like this:
+```
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 1147, got 1025
+llama_model_load_from_file_impl: failed to load model
+common_init_from_params: failed to load model '/mount/checkpoints/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf'
+srv    load_model: failed to load model, '/mount/checkpoints/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf'
+srv    operator(): operator(): cleaning up before exit...
+main: exiting due to model loading error
+terminate called without an active exception
+Aborted (core dumped)
+```
+There's a Q3_K_M version here: [daydream-org/DeepSeek-R1-GGUF-11446](https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446)
+Created using the script below by [evshiron](https://huggingface.co/evshiron):
+```python
+export WORK_DIR=$(pwd)
+python3 -m venv venv
+source venv/bin/activate
+pip3 install -U "huggingface_hub[cli]"
+# the fp8 checkpoints are around 700GB
+mkdir checkpoints
+huggingface-cli download --resume-download --local-dir checkpoints/DeepSeek-R1 deepseek-ai/DeepSeek-R1
+# my fork of llama.cpp including pr #11446 and some changes to allow converting fp8 hf to bf16 gguf directly using triton(-cpu) without the need of intermediate checkpoints
+git clone https://github.com/evshiron/llama.cpp --recursive
+pushd llama.cpp
+pip3 install -r requirements/requirements-convert_hf_to_gguf.txt
+cmake -B build
+cmake --build build --config Release
+popd
+# install triton-cpu for cpu-only dequant
+git clone https://github.com/triton-lang/triton-cpu --recursive
+pushd triton-cpu
+pip3 install ninja cmake wheel pybind11
+MAX_JOBS=32 pip3 install -e python
+popd
+# hopefully it should work, takes an hour or more depending on your hardware, the bf16 checkpoints are around 1.3TB
+# the dequant process may take more than 64GB RAM, but should be doable within 360GB RAM
+python3 llama.cpp/convert_hf_to_gguf.py --outtype bf16 --split-max-size 50G checkpoints/DeepSeek-R1
+# removing the fp8 checkpoints gives us 700GB back
+mkdir checkpoints/DeepSeek-R1-BF16
+mv checkpoints/DeepSeek-R1/*.gguf checkpoints/DeepSeek-R1-BF16
+rm -r checkpoints/DeepSeek-R1
+# then use llama-quantize to make the quants you want, Q4_K_M should be around 400GB?
+./llama.cpp/build/bin/llama-quantize --keep-split checkpoints/DeepSeek-R1-BF16/<THE_FIRST_OF_DeepSeek-R1-BF16_GGUF>.gguf Q4_K_M
+```
+It took 16 hours on an EC2 instance so I figured I'd share it.
+Script Credit/Source: [daydream-org/DeepSeek-R1-GGUF-11446](https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6)