Context Length 32k tokens ?
Why this GGUF has context length in description 32k? Here https://huggingface.co/Qwen/Qwen2.5-72B-Instruct it states 131k context length. What happened?
Have llama.cpp supported YaRN yet? If it has, enabling YaRN as with the original model in its modelcard should extend the context length.
Have llama.cpp supported YaRN yet? If it has, enabling YaRN as with the original model in its modelcard should extend the context length.
Should it? I have never heard about Yarn, I tried to find issues in llama.cpp github repo, still nothing , neither opened or closed issue. If it supports,so my original question, why 32k context length in description still?
128K context length needs YaRN (that's what we have tested). no YaRN no 128K.
If you use other methods to extend the context length, they may work also. But we don't really know.
llama.cpp got yarn support of some kind merged before Nov 4, 2023 https://github.com/ggerganov/llama.cpp/discussions/2963#discussioncomment-7475016
I suggest directing queries to the github.com discussions or issues pages.
I also find some discussion here: https://github.com/ggerganov/llama.cpp/discussions/7416
awesome..so no reason to state 32k in the description if llama.cpp supports yarn since 11/2023 and 128K by default.
if it is supported, you need to enable it. not by default.
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
Argument | Explanation |
---|---|
--rope-scaling {none,linear,yarn} |
RoPE frequency scaling method, defaults to linear unless specified by the model (env: LLAMA_ARG_ROPE_SCALING_TYPE) |
--rope-scale N |
RoPE context scaling factor, expands context by a factor of N (env: LLAMA_ARG_ROPE_SCALE) |
--rope-freq-base N |
RoPE base frequency, used by NTK-aware scaling (default: loaded from model) (env: LLAMA_ARG_ROPE_FREQ_BASE) |
--rope-freq-scale N |
RoPE frequency scaling factor, expands context by a factor of 1/N (env: LLAMA_ARG_ROPE_FREQ_SCALE) |
--yarn-orig-ctx N |
YaRN: original context size of model (default: 0 = model training context size) (env: LLAMA_ARG_YARN_ORIG_CTX) |
--yarn-ext-factor N |
YaRN: extrapolation mix factor (default: -1.0, 0.0 = full interpolation) (env: LLAMA_ARG_YARN_EXT_FACTOR) |
--yarn-attn-factor N |
YaRN: scale sqrt(t) or attention magnitude (default: 1.0) (env: LLAMA_ARG_YARN_ATTN_FACTOR) |
--yarn-beta-slow N |
YaRN: high correction dim or alpha (default: 1.0) (env: LLAMA_ARG_YARN_BETA_SLOW) |
--yarn-beta-fast N |
YaRN: low correction dim or beta (default: 32.0) (env: LLAMA_ARG_YARN_BETA_FAST) |