Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
eaddario 
posted an update 5 days ago
Post
2692
Squeezing out tensor bits?

I have been tinkering with quantization and pruning to reduce model sizes. So far, I've had modest success in producing, on average, 8% smaller versions with negligible loss of quality, and I think further reductions in the 10-15% range are realistic, but I've come across a behaviour I wasn't expecting!

Part of the process I'm following consists of quantizing the embedding and output layers aggressively. Since the embedding layer is more about lookup than complex computation, the vectors representing the relative distances between embeddings are usually preserved well enough making this layer fairly robust to quantization. So far, so good.

The output layer, on the other hand, maps the final hidden state to the vocabulary logits and therefore, small changes in these logits could lead to a different probability distribution over the vocabulary, resulting in incorrect word predictions, or so I thought.

Surprisingly, I'm finding that even at Q2_K the loss of overall capability is minimal. Was this to be expected? or am I missing something?

I have published a version with all the test results if you want to give it a try: eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF

I'll upload other models as time allows.

Any ideas / clarifications / suggestions are very much welcomed!

Thank you for your research. How exactly did you test the model quantized two K that it is like capability minimal?

·

In this case, the Q2_K refers to the quantization of the embedding layer applied to each version of the model, rather than the overall quantization used. For example, the DeepSeek-R1-Distill-Qwen-7B-Q4_K_M model would have its embedding layer quantized at Q2_K instead of the usual Q4_K.

Once all the quantized versions are generated, I then produce Perplexity, KL Divergence, ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores for each version using the test datasets documented in the model card.