Huh?

#1
by Nycnt - opened

What even is this?

So many downloads and only so many days up.
What the heck?

I see what it is, some chinese dudes are using ktransformers to load the small unsloth R1 quants for supposedly faster inferencing than llama.cpp. I was trying to do it myself and realized ktransformers is strange and requires a bunch of junk files so this guy did the thing.

https://github.com/kvcache-ai/ktransformers/issues/186#issuecomment-2655656578

They copy pasted the unsloth/DeepSeek-R1 UD_IQ1_S model and added a few config files to get ktransformers going e.g.

python -m ktransformers.local_chat --model_name is210379/DeepSeek-R1-UD-IQ1_S --gguf_path /media/intel/ollamaModel/DeepSeek-R1-UD-IQ1_S --cpu_infer 60 --max_new_tokens 1000 --force_think true

https://kvcache-ai.github.io/ktransformers/en/DeepseekR1_V3_tutorial.html

Folks using transformers are having the issue over in the real repo here: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/32

YES!!!

I got it working, seems faster than llama.cpp in very early test though unsure of context length etc.. seems like flash attention might be working too?? (that isn't in llama.cpp yet??)

tl;dr;
Put all the json and py files from this repo into the directory with your unsloth GGUF files e.g.

ls /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/
config.json                                 DeepSeek-R1-UD-Q2_K_XL-00002-of-00005.gguf  DeepSeek-R1-UD-Q2_K_XL-00005-of-00005.gguf  tokenizer.json
configuration_deepseek.py                   DeepSeek-R1-UD-Q2_K_XL-00003-of-00005.gguf  generation_config.json
DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf  DeepSeek-R1-UD-Q2_K_XL-00004-of-00005.gguf  tokenizer_config.json
python ./ktransformers/local_chat.py \
    --gguf_path "/mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/" \
    --model_path "/mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/"
    --prompt_file ./p.txt \
    --cpu_infer 24 \
    --max_new_tokens 1024 \
    --force_think true \
    --port 8000 \
    --web True
prompt eval count:    9 token(s)
prompt eval duration: 0.8497006893157959s
prompt eval rate:     10.591965045064358 tokens/s
eval count:           300 token(s)
eval duration:        26.647690773010254s
eval rate:            11.25801115584285 tokens/s

@Nycnt

i had to patch a bug but got it kind of working, put together a quick guide here:

https://github.com/ubergarm/r1-ktransformers-guide

it does seem faster than llama.cpp even using much less VRAM right now... though I haven't tried the custom branch of llama.cpp to selectively offload experts very much

it seems to get stuck in a loop responding and there are some errors if the chat continues, so not sure it is really working quite right yet.. lol

Sign up or log in to comment