EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Abstract
Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More (2024)
- LQER: Low-Rank Quantization Error Reconstruction for LLMs (2024)
- OneBit: Towards Extremely Low-bit Large Language Models (2024)
- L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ (2024)
- EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Interesting paper! Any plans on releasing the code implementation?
SqueezeLLM: Dense-and-Sparse Quantization
https://huggingface.co/papers/2306.07629
This paper looks similar to it. It is a paper that stores salient weights of 0.4 to 1% in full precision and applies non-uniform quantization to better represent non-outliers.
As you may have noticed, since they use the sensitivity based on the Hessian to select important weights in the paper, it is not a "perfect data-free" quantization as mentioned in this paper. However, if you check out the SqueezeLLM repository on GitHub, you will see that the actual implementation allows using a magnitude threshold as a criterion.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper