用llama.cpp部署时无法启用flashattn

#5
by gimling - opened

日志显示 llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off

llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 64

OpenBMB org

MiniCPM3 uses mla instead of standard multi-head attention or group-query attention. The current flash-attn implementation in llama.cpp doesn't support it. We will create a pull request in the official repo soon.

neoz changed discussion status to closed

Sign up or log in to comment