Missing the RMSnorm and activation_quant?
According to the original paper, those 2 components were also being used.
That's true, they are intentionally ignored in current version (since we want to test the performance module by module). I will check my readme to ensure it is clearly noticed.
Another reason is that, if I understand correctly, the activation quant on input and the additional norm would even make the model slower than original model. You won't want to run quantization functions during inference. Based on my test, the model is 2-3x speed during inference by removing the weight quant, which means quantizations significantly influence inference efficiency. I would expect a similar speed drop by applying their activation quant - the model might run faster without it (very likely).
https://github.com/microsoft/BitBLAS
So based on the benchmarks that was reported, there is a significant speed up when using INT8xINT8 (combine 4 2bits params) for BitLinear. I'm running some tests with this to verify. The biggest concern imo is the information loss from quantifying both the inputs and weights with a small model.