Request for another exl Quant.

by Clevyby - opened Feb 14

Feb 14

•

Hello, I would like to try to use this on 15gb vram and I estimated that the necessary bpw would be at least 2-2.3bpw. Or at least a safetensor size of 10.3 gb. Which is equivalent to 4bpw of a 20b. Nevertheless, Do you think it's feasible quality wise?

LoneStriker

Owner Feb 15

Have you tried loading the 3.0bpw model? Here are the raw file sizes of the various bpw quants:

$ du -sh Smaug-34B-v0.1-*
13G     Smaug-34B-v0.1-3.0bpw-h6-exl2
17G     Smaug-34B-v0.1-4.0bpw-h6-exl2
20G     Smaug-34B-v0.1-4.65bpw-h6-exl2
21G     Smaug-34B-v0.1-5.0bpw-h6-exl2
25G     Smaug-34B-v0.1-6.0bpw-h6-exl2
33G     Smaug-34B-v0.1-8.0bpw-h8-exl2

2.4 bpw for 70B models is still coherent, but you do lose abilities. You want to get as many bits as possible, especially for smaller models. I would test by reducing the context to 2048 and test the quality of the 3.0bpw model and see how good it is for your use case.

I'll add 2.4bpw to the queue.

Clevyby

Feb 15

•

edited Feb 15

Thanks, I'll try later. When I ran 20b 4bpw with 6k context with 8-bit cache, I reached a usage of 14.7/15gb. I tried to use 4x7b 4bpw once, but I can only run 3bpw of that, that's why I specified the safetensor size. I prefer to run with 6k context, Honestly, I just want to test feasability, If I could run this on free colab with exl2. I could definitely can run 20b 4bpw with 8-bit cache albeit with strained token per second speed (4-7 tokens per second). I did try 3bpw of 20b and I found it to be sufficient enough. Though, I'm afraid tokens per second might tank to undesirable levels. I'm doing this for rp purposes. Do tell me though if tokens per second would tank considerably if I could run this 2.4 bpw in free colab, it might not be worth it to begin with. Also on hugginface, the 3bpw version appeared to be 13.8gb, I'm not sure why, but do keep in mind that hugginface version of 20b 4bpw appeared as 10.33gb.

LoneStriker

Owner Feb 15

2.4 uploading. Size:
11G Smaug-34B-v0.1-2.4bpw-h6-exl2

I'm not sure the speed of the free Colab GPUs. exl2 should be fast with an 11 GB model on 20/30/40 series NVidia GPUs.

Clevyby

Feb 15

Wow, I didn't expect you to upload that fast. Though, thanks very much. I ran the 2.4 bpw recently with 6k context and 8-bit cache and surprisingly, usage of vram peaked at 13.1 gb vram, and tokens per second was good and sufficient for rp purposes (3-5 tokens per second.). That means I could probably run a 3bpw at 6k context but I'll test that later. Maybe a 2.65bpw would be optimal, but I'll let you know later if need be. Again, thanks for obliging with my request.

Clevyby

Feb 15

Alright, so 3bpw didn't work, it has a base starting vram of 14.5, and usually when in usaye, about 1gb extra is used. So, perhaps can you make a 2.65 bpw if that's okay with you? I'm aiming for a base starting vram of 14gb.

LoneStriker

Owner Feb 15

2.65bpw up, 12 GB size.

Clevyby

Feb 15

•

edited Feb 15

Thanks, it worked! Peak usage this time reached around 14.1 gb. Alright, possibly second to last request, can you make a 2.7bpw of Kyllene 34b?

Clevyby

Feb 15

Alright, thanks for the recent quant, it reached a vram usage peak of 14.3, for the last request, can you make 2.75 and 2.8bpw of Kyllene, I think that's the optimal bpw range for 15gb vram,

Clevyby

Feb 16

•

edited Feb 16

Thanks for the recent 2.8 bpw quant, it reached a peak usage of 14.7 gb vram, I think I can push until 2.85 bpw which would give me a peak usage of 14.9/15 but it might give me out of error.

Clevyby changed discussion status to closed Feb 16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment