Experimenting with dynamic quantization

#1
by Lunzima - opened

https://gist.github.com/lunzima/dbca4281acf7c6bb0100e26a0a51de06

This patch modifies the llama_tensor_get_type function to optimize the quantization strategy for different
tensor types in the model. The main changes are:

  1. FFN Layer Quantization:

    • Added specific quantization types for ffn_down, ffn_gate, and ffn_up tensors, depending on their layer
      position. The first few layers use higher precision, while others use lower precision.
    • Introduced quantization types for shared expert layers (ffn_down_shexp and ffn_gate_shexp) with higher
      efficiency.
  2. Attention Layer Quantization:

    • Improved quantization type allocation for attn_v.weight based on model architecture and parameters.
    • Specified quantization types for MLA projection matrices (attn_kv_a_mqa.weight, attn_kv_b.weight,
      attn_q_a.weight, and attn_q_b.weight).
  3. Model Architecture and Parameter Configuration:

    • Adjusted quantization types for attn_output.weight based on model architecture and parameters.
    • Included a check for is_one_bit to differentiate quantization strategies for different model types.
  4. Code Structure:

    • Added more else if conditions and function calls to clarify the quantization logic and improve
      maintainability.

Overall, the patch refines the quantization process to enhance model performance and efficiency, especially in
complex architectures.

Lunzima pinned discussion

Sign up or log in to comment