EXL2 quants of gemma-2-27b-it
My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume 8k context.
bpw | head | 4 bit cache | 16 bit cache | Notes |
---|---|---|---|---|
5.8 | 8 bit | 21.85 GB | 23.69 GB | 16 bit cache, but lower BPW |
๐ 6.5 | 8 bit | 23.81 GB | 25.65 GB | ๐ my recommendation |
6.6 | 6 bit | 23.86 GB | 25.70 GB | slightly higher BPW, but less precise head |
For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body.
Check out turboderp's quants & measurement.json:
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.