EXL2 quants of gemma-2-27b-it

My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume 8k context.

bpw head 4 bit cache 16 bit cache Notes
5.8 8 bit 21.85 GB 23.69 GB 16 bit cache, but lower BPW
๐Ÿ‘‰ 6.5 8 bit 23.81 GB 25.65 GB ๐Ÿ‘ˆ my recommendation
6.6 6 bit 23.86 GB 25.70 GB slightly higher BPW, but less precise head

For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body.


Check out turboderp's quants & measurement.json:
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight

measurement.json

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.