|
--- |
|
license: gemma |
|
--- |
|
EXL2 quants of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) |
|
|
|
My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume **8k context**. |
|
|
|
bpw|head|4 bit cache|16 bit cache|Notes |
|
--:|--:|--:|--:|:-- |
|
[**5.8**](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/5.8_h8)|**8 bit**|21.85 GB|**23.69 GB**|16 bit cache, but lower BPW |
|
π [**6.5**](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/6.5_h8)|**8 bit**|**23.81 GB**|25.65 GB|π my recommendation |
|
[**6.6**](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/6.6_h6)|6 bit|**23.86 GB**|25.70 GB|slightly higher BPW, but less precise head |
|
|
|
For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body. |
|
|
|
--- |
|
Check out turboderp's quants & measurement.json: |
|
[3.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/3.0bpw) |
|
[3.50 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/3.5bpw) |
|
[4.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/4.0bpw) |
|
[4.50 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/4.5bpw) |
|
[5.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/5.0bpw) |
|
[6.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/6.0bpw) |
|
[8.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/8.0bpw) |
|
|
|
[measurement.json](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/blob/main/measurement.json) |