用llama.cpp部署时无法启用flashattn
#5
by
gimling
- opened
日志显示 llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 64
MiniCPM3 uses mla instead of standard multi-head attention or group-query attention. The current flash-attn implementation in llama.cpp doesn't support it. We will create a pull request in the official repo soon.
neoz
changed discussion status to
closed