How is the performance of the model with 2bits only?
#1
by
DrNicefellow
- opened
ANyone tested it?
You can find numbers for the base model and comparison with bitsandbytes below:
Wikitext2 PPL/Memory: HQQ vs bitsandbytes (BNB)
#8-bit (group_size=128)
Mixtral-8x7B-v0.1 / BNB : 3.64 | (54.5 GB)
Mixtral-8x7B-v0.1 / HQQ : 3.63 | (47 GB)
#4-bit (group_size=64)
Mixtral-8x7B-v0.1 / BNB : 3.97 | (27 GB)
Mixtral-8x7B-v0.1 / HQQ : 3.79 | (26 GB)
#3-bit (group_size=128)
Mixtral-8x7B-v0.1 / HQQ : 4.76 | (21.8 GB)
#2-bit (group_size=16 | scale_g128/zero=8-bit):
Mixtral-8x7B-v0.1 / HQQ : 5.90 | (18 GB)
I wouldn't recommend using the 2-bit in production, rather use the 4-bit version. But we wanted to provide the community with a model that can run on a single 24 GB card so they can play with it and see if they like the feel of the model compared to others. I have personally played with it and the instruct model is working surprisingly fine with this 2-bit settings.