Lee's RoPE Tricks / Context Extension Reads
Set of Long Context (RoPE or otherwise) I'm collecting off of HF
Paper • 2402.13753 • Published • 111Note 2/22/24 Type: PI, using hparam search for scale-factor per dimension 1. Observe high/low frequency inequality between the early/late hidden dimensions. Approach via progressive scaling over dimension (e.g. no scaling for dim 0 of 64, by 40, it's scaled down by ~30%) 2. Avoid scaling the first N tokens (avoid the attention sink) 3. Search for two parameters - (scale_i, window), with i being the hidden dimension #, scale-factor must monotonically increase as dimension # increases
Data Engineering for Scaling Language Models to 128K Context
Paper • 2402.10171 • Published • 21Note 2/22/24 From author: I tend to view the contribution is data and data alone, not only the data composition but also the data scale. When comparing this work with https://arxiv.org/abs/2309.16039, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source.
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
Paper • 2402.11550 • Published • 15Note In backlog
The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey
Paper • 2401.07872 • Published • 2Note 2/5/24 Type: Survey Seem to mostly copy (sometimes verbatim) from surveyed work, does not include every RoPE trick PEs: Alibi, RoPE, Random PE (missing NoPE, T5, APE) RoPE tricks: Linear PE, YaRN, PoSE
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Paper • 2402.09727 • Published • 35Note In backlog
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss
Paper • 2402.10790 • Published • 40Note In backlog
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Paper • 2401.01325 • Published • 26Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
Paper • 2401.03462 • Published • 26YaRN: Efficient Context Window Extension of Large Language Models
Paper • 2309.00071 • Published • 65Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 14Extending Context Window of Large Language Models via Semantic Compression
Paper • 2312.09571 • Published • 12Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention
Paper • 2312.08618 • Published • 11E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
Paper • 2401.06951 • Published • 24Extending LLMs' Context Window with 100 Samples
Paper • 2401.07004 • Published • 14KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Paper • 2401.18079 • Published • 7KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Paper • 2402.02750 • Published • 3CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling
Paper • 2309.05270 • Published • 1
Structured Packing in LLM Training Improves Long Context Utilization
Paper • 2312.17296 • Published • 2Note 2/23/24 In this work, we take a step towards better context utilization in LCLMs. We focus on training data, keeping other components, such as the architecture and training objectives, unchanged. The broad question is how to organize training data to enhance long context capabilities? I think 1.5 uses a flavor of this technique to some extent. In particular to disambiguate groups of articles packed together. Likely uses custom seperators, attentions for each packed group of articles.
Lost in the Middle: How Language Models Use Long Contexts
Paper • 2307.03172 • Published • 36Note 2/23/24 U-Shaped Performance: Models are better at using information at the beginning (primacy bias) or end (recency bias) of the context. Performance drops when information is in the middle, indicating a limitation in handling long contexts. They likened this trend to the serial-position effect found in psychology.
In-Context Pretraining: Language Modeling Beyond Document Boundaries
Paper • 2310.10638 • Published • 28Note 2/23/24 By simply reordering the pretraining data, this method offers a scalable way to significantly enhance the contextual reasoning abilities of language models. Instead of random documents, the models are trained on sequences of related documents. This simple change encourages the models to reason over longer contexts and learn relationships between documents, significantly boosting performance on tasks requiring contextual understanding.
Do Transformers Need Deep Long-Range Memory
Paper • 2007.03356 • Published • 1ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
Paper • 2310.12442 • Published • 1FIT: Far-reaching Interleaved Transformers
Paper • 2305.12689 • Published • 1DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 17World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 36Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 10Scaling Laws of RoPE-based Extrapolation
Paper • 2310.05209 • Published • 6InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory
Paper • 2402.04617 • Published • 4LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Paper • 2402.10685 • Published • 1The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
Paper • 2402.04347 • Published • 13Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 18Functional Interpolation for Relative Positions Improves Long Context Transformers
Paper • 2310.04418 • Published • 4Resonance RoPE: Improving Context Length Generalization of Large Language Models
Paper • 2403.00071 • Published • 22Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
Yi: Open Foundation Models by 01.AI
Paper • 2403.04652 • Published • 62Note 1. Ring Self-Attention for sequence-parallelism 2. Long context data engineering (incl generalization AND long-range retrieval acc) 3. ABF = 10M Very similar to LWM!
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
Paper • 2311.12351 • Published • 3Transformer Language Models without Positional Encodings Still Learn Positional Information
Paper • 2203.16634 • Published • 5Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization
Paper • 2401.07793 • Published • 3LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper • 2308.16137 • Published • 39Randomized Positional Encodings Boost Length Generalization of Transformers
Paper • 2305.16843 • Published • 2Empower Your Model with Longer and Better Context Comprehension
Paper • 2307.13365 • Published • 1Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper • 2403.09636 • Published • 2