Thank you for posting memory usage
Memory usage tests
2.65bpw
context 16k, cache 16: 46.9GiB (fits in 2x 3090)
context 32k, cache 8: 47GiB (fits in 2x 3090)
3bpw
context 8k, cache 16: 47.4GiB (fits in 2x 3090)
context 16k, cache 8: 47.4GiB (fits in 2x 3090)
4.35bpw
context 16k, cache 16: 70.1GiB (fits in 3x 3090)
context 32k, cache 8: 70.3GiB (fits in 3x 3090)
context 32k, cache 16: 78.7GiB (fits in A100 80GB)
Just wanted to say, thank you SO MUCH for posting detailed memory usage for the various quants.
This has been missing from many people's model cards and for those of us running local llm's, it's a key factor in choosing a quant.
Please keep it up, you are a model citizen.
Thanks! I had calculated the exact bpw needed to hit these memory sizes in advance, and then the tests were to verify that it worked :)
You can do it for your own quants fairly easily:
Start with any size quant of the model size you want (i.e. any exl2 70b, or 120b, or whatever).
Load the model on an idle A100 80GB with zero context size. The value you get in nvidia-smi will be your base memory size, this scales pretty much linearly with the bpw of the quant.
Then load it again, but at the different useful context sizes and context bit widths you care about (i.e. 16k/32k x 8/16 bit).
The amount this uses above the base memory size is how much space the context takes up at that configuration, this remains unchanged for any bpw.
Then you can easily put it together: max bpw = (target mem - context size) / base size * base bpw.
Give it like 1GiB slack to account for allocation weirdness, only being able to split between GPUs at layer borders, and people running desktop environments.