Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,25 @@
|
|
1 |
-
---
|
2 |
-
license: gemma
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: gemma
|
3 |
+
---
|
4 |
+
EXL2 quants of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it)
|
5 |
+
|
6 |
+
My quants are meant to be a tight fit in 24 GB VRAM.
|
7 |
+
|
8 |
+
- [**5.8** bpw & **8** bpw head](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/5.8bpw_h8)
|
9 |
+
should use **21.85 GB VRAM** with 4 bit cache or **23.69 GB** with 16 bit cache
|
10 |
+
- [**6.5** bpw & **8** bpw head](https://huggingface.co/mo137/gemma-2-27b-it-exl2/tree/6.5bpw_h8)
|
11 |
+
should use **23.81 GB VRAM** with 4 bit cache
|
12 |
+
|
13 |
+
The difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body, so 6.6bpw_h6 should use similar VRAM to 6.5bpw_h8.
|
14 |
+
|
15 |
+
---
|
16 |
+
Check out turboderp's quants & `measurement.json`:
|
17 |
+
[3.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/3.0bpw)
|
18 |
+
[3.50 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/3.5bpw)
|
19 |
+
[4.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/4.0bpw)
|
20 |
+
[4.50 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/4.5bpw)
|
21 |
+
[5.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/5.0bpw)
|
22 |
+
[6.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/6.0bpw)
|
23 |
+
[8.00 bits per weight](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/8.0bpw)
|
24 |
+
|
25 |
+
[measurement.json](https://huggingface.co/turboderp/gemma-2-27b-it-exl2/blob/main/measurement.json)
|