4bpw

#1
by Nephilim - opened

Hi, it's possible for you to make a exactly 4bpw quant?

Sure, what are you trying to fit it on that 4bpw would fit better?

rtx 4060 ti 16gb
I'm currently using a 4bpw version of the base Buttercup model and it fits perfectly on my card with the max context(32k)

Hmm that seems surprising, from my math 32k context with 4bpw should take ~16.7 GB, but i'll make it and check if i'm calculating wrong

@Nephilim 4.0 is up: https://huggingface.co/bartowski/Buttercup-4x7B-V2-laser-exl2/tree/4_0

let me know if it works and what your final usage looks like, if it makes more sense for a 16gb card i'll add it for future quants of this size

Oh, thanks, I will test it

Worked very well here, thanks again.

@Nephilim Are you sure that you're not overflowing onto system RAM? When loading the 4bpw model with 32k context I hit 16.8GB usage

100% sure, i've disabled system fallback, it runs on ~10 tokens/it here

fascinating... what's your setup, i wonder if TGWUI adds some overhead that i don't realize

i'm using the latest version of oobabooga, with the 8bit cache option enabled

ahhhhh 8 bit cache explains it!

Sign up or log in to comment