deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

2 days ago

I noticed that the config.json here has a 128k context size (like you might have with the yarn settings enabled for Qwen 2.5 models) but no yarn specific config like:

  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }

I imagine we should add these, because you did not in fact change the original max_positional_embeddings, right?

hackeys

2 days ago

Good question!
Also please tell me, after quantization in gguf, the maximum size will be 32k or the one specified in the max_position_embeddings parameter?

robbiemu

about 10 hours ago

•

edited about 10 hours ago

Good question!
Also please tell me, after quantization in gguf, the maximum size will be 32k or the one specified in the max_position_embeddings parameter?

So, I know that you would not be able to access the full 128k context without the settings I provided using a llama.cpp runtime, if they did not revise the architecture here.

I made the pull request that gave llama.cpp the ability to run the full 128k YaRN context with the Qwen2ForCausalLM model family (or really, I just reused the code from elsewhere to enable it). That's why I was asking, I know that llama.cpp will not use the yarn approach just from max_position_embeddings": 131072 and then some equivalent of their example --max-model-len 32768 to infer the YaRN scaling. I actually am pretty surprised that VLLM does that , that tightly couples them to YaRN by default (and the traditional type, not for example Phi type) when scaling.

deepseek-ai
/

DeepSeek-R1-Distill-Qwen-32B

YaRN block required?