YaRN block required?
I noticed that the config.json here has a 128k context size (like you might have with the yarn settings enabled for Qwen 2.5 models) but no yarn specific config like:
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
I imagine we should add these, because you did not in fact change the original max_positional_embeddings, right?
Good question!
Also please tell me, after quantization in gguf, the maximum size will be 32k or the one specified in the max_position_embeddings parameter?
Good question!
Also please tell me, after quantization in gguf, the maximum size will be 32k or the one specified in the max_position_embeddings parameter?
So, I know that you would not be able to access the full 128k context without the settings I provided using a llama.cpp runtime, if they did not revise the architecture here.
I made the pull request that gave llama.cpp the ability to run the full 128k YaRN context with the Qwen2ForCausalLM model family (or really, I just reused the code from elsewhere to enable it). That's why I was asking, I know that llama.cpp will not use the yarn approach just from max_position_embeddings": 131072 and then some equivalent of their example --max-model-len 32768 to infer the YaRN scaling. I actually am pretty surprised that VLLM does that , that tightly couples them to YaRN by default (and the traditional type, not for example Phi type) when scaling.