Experimenting with dynamic quantization
#1
pinned
by
Lunzima
- opened
https://gist.github.com/lunzima/dbca4281acf7c6bb0100e26a0a51de06
This patch modifies the llama_tensor_get_type
function to optimize the quantization strategy for different
tensor types in the model. The main changes are:
FFN Layer Quantization:
- Added specific quantization types for
ffn_down
,ffn_gate
, andffn_up
tensors, depending on their layer
position. The first few layers use higher precision, while others use lower precision. - Introduced quantization types for shared expert layers (
ffn_down_shexp
andffn_gate_shexp
) with higher
efficiency.
- Added specific quantization types for
Attention Layer Quantization:
- Improved quantization type allocation for
attn_v.weight
based on model architecture and parameters. - Specified quantization types for MLA projection matrices (
attn_kv_a_mqa.weight
,attn_kv_b.weight
,attn_q_a.weight
, andattn_q_b.weight
).
- Improved quantization type allocation for
Model Architecture and Parameter Configuration:
- Adjusted quantization types for
attn_output.weight
based on model architecture and parameters. - Included a check for
is_one_bit
to differentiate quantization strategies for different model types.
- Adjusted quantization types for
Code Structure:
- Added more
else if
conditions and function calls to clarify the quantization logic and improve
maintainability.
- Added more
Overall, the patch refines the quantization process to enhance model performance and efficiency, especially in
complex architectures.
Lunzima
pinned discussion