From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
Abstract
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.
Community
TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences, up to 100K tokens, while maintaining the target model's inherent quality.
Highlights | Description | Emoji |
---|---|---|
โก Speed | 3ร faster than vanilla Transformers | โฉ |
๐ฏ Lossless | Matches original model's output quality | โ |
๐ Scalability | Linear time complexity for 100K+ sequences | ๐ |
๐ ๏ธ Plug & Play | Works with most HuggingFace models | ๐ค |
Code: https://github.com/bigai-nlco/TokenSwift
Paper: https://arxiv.org/abs/2502.18890
Model: https://huggingface.co/TokenSwift
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting (2025)
- Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding (2025)
- GRIFFIN: Effective Token Alignment for Faster Speculative Decoding (2025)
- QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache (2025)
- CodeSwift: Accelerating LLM Inference for Efficient Code Generation (2025)
- Long-Context Inference with Retrieval-Augmented Speculative Decoding (2025)
- APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper