Nexesenex commited on
Commit
468b171
·
verified ·
1 Parent(s): d3edbf8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -1,11 +1,11 @@
1
- E60 means embeddings in ggml_type q6_0, thus only compatible with IK_Llama.CPP and Croco.CPP until this q6_0 quant eventually reaches mainline.
2
- E50 or no mention means embeddings in mainline ggml_types (iq4_xs or Q5_0, usually)
3
 
4
- For the IQ3 145L quant, the peformances close to IQ4_XS, with a few gigabites torn down.
5
  Best suited for 24-24-16GB GPU configs at 24-32k context with KV cache q6_0/q5_0
6
  Or for more, of course.
7
 
8
- For the IQ3_XSSL quant, the performances are probably akin to IQ3_XS.
9
 
10
  These quants are made for my own use, and I decided to share them.
11
  Nothing special about them, except that they suit my needs.
@@ -13,4 +13,5 @@ Nothing special about them, except that they suit my needs.
13
  Basically, my quant strategies obey to a few rules diverging from mainline.
14
  - I often dump attn_q by one degree of quant, like mainline does for iq3_xxs in iq2_s, as well as attn_output.
15
  - I often up attn_k and attn_v by one degree of quant, for example. Mainline usually neglects too much those tensors in the GQA era.
16
- - I bump the embeddings, because they do not offload on the GPU (aside for Bitner and Maybe Gemma).
 
 
1
+ E60/EQ60 means embeddings in ggml_type q6_0, thus only compatible with IK_Llama.CPP and Croco.CPP until this q6_0 quant eventually reaches mainline.
2
+ E50/EQ50 or no mention means embeddings in mainline ggml_types (iq4_xs or Q5_0, usually)
3
 
4
+ For the IQ3 145L quantized model, the peformances close to IQ4_XS, with a few gigabites torn down.
5
  Best suited for 24-24-16GB GPU configs at 24-32k context with KV cache q6_0/q5_0
6
  Or for more, of course.
7
 
8
+ For the IQ3_XSSL quantized model, the performances are probably akin to IQ3_XS.
9
 
10
  These quants are made for my own use, and I decided to share them.
11
  Nothing special about them, except that they suit my needs.
 
13
  Basically, my quant strategies obey to a few rules diverging from mainline.
14
  - I often dump attn_q by one degree of quant, like mainline does for iq3_xxs in iq2_s, as well as attn_output.
15
  - I often up attn_k and attn_v by one degree of quant, for example. Mainline usually neglects too much those tensors in the GQA era.
16
+ - I bump the embeddings, because they do not offload on the GPU (aside for Bitner and Maybe Gemma).
17
+ - I sometimes bump a whole FFN_down by one degree of quant, or down some layers of FFN_up and FFN_gate by one degree of quant.