gghfez commited on
Commit
6ffd157
·
verified ·
1 Parent(s): 0505a68

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -2,4 +2,72 @@
2
  license: apache-2.0
3
  base_model:
4
  - deepseek-ai/DeepSeek-R1
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  base_model:
4
  - deepseek-ai/DeepSeek-R1
5
+ ---
6
+
7
+ # Q4_K Quant of Deepseek-R1 for the MLA fork pull request
8
+
9
+ ## Requires this custom build of llama.cpp:
10
+
11
+ https://github.com/ggerganov/llama.cpp/pull/11446
12
+
13
+ ** IMPORTANT NOTE **
14
+
15
+ If you try to load this with the `main` branch of llama.cpp you'll see an error like this:
16
+
17
+ ```
18
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
19
+ llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 1147, got 1025
20
+ llama_model_load_from_file_impl: failed to load model
21
+ common_init_from_params: failed to load model '/mount/checkpoints/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf'
22
+ srv load_model: failed to load model, '/mount/checkpoints/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf'
23
+ srv operator(): operator(): cleaning up before exit...
24
+ main: exiting due to model loading error
25
+ terminate called without an active exception
26
+ Aborted (core dumped)
27
+ ```
28
+
29
+ There's a Q3_K_M version here: [daydream-org/DeepSeek-R1-GGUF-11446](https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446)
30
+
31
+ Created using the script below by [evshiron](https://huggingface.co/evshiron):
32
+
33
+ ```python
34
+ export WORK_DIR=$(pwd)
35
+ python3 -m venv venv
36
+ source venv/bin/activate
37
+ pip3 install -U "huggingface_hub[cli]"
38
+
39
+ # the fp8 checkpoints are around 700GB
40
+ mkdir checkpoints
41
+ huggingface-cli download --resume-download --local-dir checkpoints/DeepSeek-R1 deepseek-ai/DeepSeek-R1
42
+
43
+ # my fork of llama.cpp including pr #11446 and some changes to allow converting fp8 hf to bf16 gguf directly using triton(-cpu) without the need of intermediate checkpoints
44
+ git clone https://github.com/evshiron/llama.cpp --recursive
45
+ pushd llama.cpp
46
+ pip3 install -r requirements/requirements-convert_hf_to_gguf.txt
47
+ cmake -B build
48
+ cmake --build build --config Release
49
+ popd
50
+
51
+ # install triton-cpu for cpu-only dequant
52
+ git clone https://github.com/triton-lang/triton-cpu --recursive
53
+ pushd triton-cpu
54
+ pip3 install ninja cmake wheel pybind11
55
+ MAX_JOBS=32 pip3 install -e python
56
+ popd
57
+
58
+ # hopefully it should work, takes an hour or more depending on your hardware, the bf16 checkpoints are around 1.3TB
59
+ # the dequant process may take more than 64GB RAM, but should be doable within 360GB RAM
60
+ python3 llama.cpp/convert_hf_to_gguf.py --outtype bf16 --split-max-size 50G checkpoints/DeepSeek-R1
61
+
62
+ # removing the fp8 checkpoints gives us 700GB back
63
+ mkdir checkpoints/DeepSeek-R1-BF16
64
+ mv checkpoints/DeepSeek-R1/*.gguf checkpoints/DeepSeek-R1-BF16
65
+ rm -r checkpoints/DeepSeek-R1
66
+
67
+ # then use llama-quantize to make the quants you want, Q4_K_M should be around 400GB?
68
+ ./llama.cpp/build/bin/llama-quantize --keep-split checkpoints/DeepSeek-R1-BF16/<THE_FIRST_OF_DeepSeek-R1-BF16_GGUF>.gguf Q4_K_M
69
+ ```
70
+
71
+ It took 16 hours on an EC2 instance so I figured I'd share it.
72
+
73
+ Script Credit/Source: [daydream-org/DeepSeek-R1-GGUF-11446](https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6)