gemma-2-27b-it-exl2 / README.md
mo137's picture
Update README.md
0519e0e verified
|
raw
history blame
1.4 kB
---
license: gemma
---
EXL2 quants of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it)
My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume **8k context**.
bpw|head|4 bit cache|16 bit cache
--:|--:|--:|--:
πŸ‘‰ [**5.8**](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/5.8bpw_h8)|8 bit|21.85 GB|23.69 GB
πŸ‘‰ [**6.5**](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/6.5bpw_h8)|8 bit|23.81 GB|25.65 GB
For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body.
---
Check out turboderp's quants & `measurement.json`:
[3.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/3.0bpw)
[3.50 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/3.5bpw)
[4.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/4.0bpw)
[4.50 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/4.5bpw)
[5.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/5.0bpw)
[6.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/6.0bpw)
[8.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/8.0bpw)
[measurement.json](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/blob/main/measurement.json)