E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
Abstract
Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issues, we propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced computation cost, which also removes the need to collect long-context data. Concretely, first, the training data of our E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost greatly. Second, the training procedure on the short training context window is performed only once time, and we can support different evaluation context windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings, we introduce two different augmentation methods on the scale and position index parameters for different samples in training. It aims to make the model more robust to the different relative differences when directly interpolating the arbitrary context length at inference. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Extending LLMs' Context Window with 100 Samples (2024)
- LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning (2024)
- Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization (2024)
- Extending Context Window of Large Language Models via Semantic Compression (2023)
- Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Hi folks, after performing the sampled scaled + shifted fine-tuning, do you see the resulting model improve extrapolation at long sequences (> previously trained context window) without scaling up g (for free)?
A common suspicion many people have is that the self-attention overfits the (admittedly very sparse) integer relative positions (e.g. 0 .. 2048 or 0 .. 4096) and coupled with some approximation-theoretic failures. This could be why extrapolation fails so catastrophically - the attention doesn't learn the necessary representations to use the rotary encoding (e.g. the rotational invariance) and overfits an approximation (maybe a polynomial) that fails catastrophically at the training boundary.
The scheme presented in $E^2$-LLM seems to resolve the sparsity issue, and if the suspicion is correct, you should also see a corresponding improvement in extrapolations without Positional Interpolation during inference (as long as self-attention finds a way to learn the proper representation for the rotary encoding)
Extending AI's Memory: E2-LLM Breakthrough in Large Language Models
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper