-
Wide Residual Networks
Paper • 1605.07146 • Published • 2 -
Characterizing signal propagation to close the performance gap in unnormalized ResNets
Paper • 2101.08692 • Published • 2 -
Pareto-Optimal Quantized ResNet Is Mostly 4-bit
Paper • 2105.03536 • Published • 2 -
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Paper • 2106.01548 • Published • 2
Collections
Discover the best community collections!
Collections including paper arxiv:2404.07129
-
Resonance RoPE: Improving Context Length Generalization of Large Language Models
Paper • 2403.00071 • Published • 22 -
Scaling Laws of RoPE-based Extrapolation
Paper • 2310.05209 • Published • 6 -
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Paper • 2404.12387 • Published • 38 -
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper • 2404.14619 • Published • 124
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 78 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
-
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Paper • 2410.21272 • Published • 1 -
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Paper • 2410.20526 • Published • 1 -
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Paper • 2410.15999 • Published • 17 -
Decomposing The Dark Matter of Sparse Autoencoders
Paper • 2410.14670 • Published • 1