-
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Paper • 2404.15420 • Published • 7 -
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper • 2404.14619 • Published • 124 -
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Paper • 2404.14219 • Published • 253 -
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper • 2404.14047 • Published • 44
Collections
Discover the best community collections!
Collections including paper arxiv:2306.12929
-
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Paper • 2306.12929 • Published • 12 -
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Paper • 2309.02784 • Published • 1 -
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Paper • 2310.08041 • Published • 1 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 602
-
Replacing softmax with ReLU in Vision Transformers
Paper • 2309.08586 • Published • 17 -
Softmax Bias Correction for Quantized Generative Models
Paper • 2309.01729 • Published • 1 -
The Closeness of In-Context Learning and Weight Shifting for Softmax Regression
Paper • 2304.13276 • Published • 1 -
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Paper • 2306.12929 • Published • 12
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Paper • 2309.06180 • Published • 25 -
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper • 2308.16137 • Published • 39 -
Scaling Transformer to 1M tokens and beyond with RMT
Paper • 2304.11062 • Published • 2 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 17
-
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper • 2310.08659 • Published • 22 -
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper • 2309.14717 • Published • 44 -
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Paper • 2309.02784 • Published • 1 -
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
Paper • 2309.16119 • Published • 1