Request for another exl Quant.

#1
by Clevyby - opened

Hello, I would like to try to use this on 15gb vram and I estimated that the necessary bpw would be at least 2-2.3bpw. Or at least a safetensor size of 10.3 gb. Which is equivalent to 4bpw of a 20b. Nevertheless, Do you think it's feasible quality wise?

Have you tried loading the 3.0bpw model? Here are the raw file sizes of the various bpw quants:

$ du -sh Smaug-34B-v0.1-*
13G     Smaug-34B-v0.1-3.0bpw-h6-exl2
17G     Smaug-34B-v0.1-4.0bpw-h6-exl2
20G     Smaug-34B-v0.1-4.65bpw-h6-exl2
21G     Smaug-34B-v0.1-5.0bpw-h6-exl2
25G     Smaug-34B-v0.1-6.0bpw-h6-exl2
33G     Smaug-34B-v0.1-8.0bpw-h8-exl2

2.4 bpw for 70B models is still coherent, but you do lose abilities. You want to get as many bits as possible, especially for smaller models. I would test by reducing the context to 2048 and test the quality of the 3.0bpw model and see how good it is for your use case.

I'll add 2.4bpw to the queue.

Thanks, I'll try later. When I ran 20b 4bpw with 6k context with 8-bit cache, I reached a usage of 14.7/15gb. I tried to use 4x7b 4bpw once, but I can only run 3bpw of that, that's why I specified the safetensor size. I prefer to run with 6k context, Honestly, I just want to test feasability, If I could run this on free colab with exl2. I could definitely can run 20b 4bpw with 8-bit cache albeit with strained token per second speed (4-7 tokens per second). I did try 3bpw of 20b and I found it to be sufficient enough. Though, I'm afraid tokens per second might tank to undesirable levels. I'm doing this for rp purposes. Do tell me though if tokens per second would tank considerably if I could run this 2.4 bpw in free colab, it might not be worth it to begin with. Also on hugginface, the 3bpw version appeared to be 13.8gb, I'm not sure why, but do keep in mind that hugginface version of 20b 4bpw appeared as 10.33gb.

2.4 uploading. Size:
11G Smaug-34B-v0.1-2.4bpw-h6-exl2

I'm not sure the speed of the free Colab GPUs. exl2 should be fast with an 11 GB model on 20/30/40 series NVidia GPUs.

Wow, I didn't expect you to upload that fast. Though, thanks very much. I ran the 2.4 bpw recently with 6k context and 8-bit cache and surprisingly, usage of vram peaked at 13.1 gb vram, and tokens per second was good and sufficient for rp purposes (3-5 tokens per second.). That means I could probably run a 3bpw at 6k context but I'll test that later. Maybe a 2.65bpw would be optimal, but I'll let you know later if need be. Again, thanks for obliging with my request.

Alright, so 3bpw didn't work, it has a base starting vram of 14.5, and usually when in usaye, about 1gb extra is used. So, perhaps can you make a 2.65 bpw if that's okay with you? I'm aiming for a base starting vram of 14gb.

2.65bpw up, 12 GB size.

Thanks, it worked! Peak usage this time reached around 14.1 gb. Alright, possibly second to last request, can you make a 2.7bpw of Kyllene 34b?

Alright, thanks for the recent quant, it reached a vram usage peak of 14.3, for the last request, can you make 2.75 and 2.8bpw of Kyllene, I think that's the optimal bpw range for 15gb vram,

Thanks for the recent 2.8 bpw quant, it reached a peak usage of 14.7 gb vram, I think I can push until 2.85 bpw which would give me a peak usage of 14.9/15 but it might give me out of error.

Clevyby changed discussion status to closed

Sign up or log in to comment