kaizuberbuehler
's Collections
LM Architectures
updated
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
•
2404.08801
•
Published
•
64
RecurrentGemma: Moving Past Transformers for Efficient Open Language
Models
Paper
•
2404.07839
•
Published
•
43
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper
•
2404.05892
•
Published
•
33
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
•
2312.00752
•
Published
•
139
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
59
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
•
2403.19887
•
Published
•
104
KAN: Kolmogorov-Arnold Networks
Paper
•
2404.19756
•
Published
•
108
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
73
Contextual Position Encoding: Learning to Count What's Important
Paper
•
2405.18719
•
Published
•
5
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
64
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
50
Alleviating Distortion in Image Generation via Multi-Resolution
Diffusion Models
Paper
•
2406.09416
•
Published
•
27
Transformers meet Neural Algorithmic Reasoners
Paper
•
2406.09308
•
Published
•
43
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
•
2406.07522
•
Published
•
37
Explore the Limits of Omni-modal Pretraining at Scale
Paper
•
2406.09412
•
Published
•
10
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
Tree Self-refine with LLaMa-3 8B
Paper
•
2406.07394
•
Published
•
26
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
22
Mixture of A Million Experts
Paper
•
2407.04153
•
Published
•
5
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Paper
•
2407.12854
•
Published
•
29
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
•
2407.21770
•
Published
•
22
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
•
2408.04619
•
Published
•
156
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
•
2408.12570
•
Published
•
31
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
51
LLMs + Persona-Plug = Personalized LLMs
Paper
•
2409.11901
•
Published
•
32
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
•
2409.16280
•
Published
•
18
Paper
•
2410.05258
•
Published
•
169
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
•
2412.09871
•
Published
•
86
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
•
2412.01169
•
Published
•
12
Monet: Mixture of Monosemantic Experts for Transformers
Paper
•
2412.04139
•
Published
•
12
MH-MoE:Multi-Head Mixture-of-Experts
Paper
•
2411.16205
•
Published
•
24
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
•
2411.13676
•
Published
•
40
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
•
2411.10958
•
Published
•
52
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
•
2411.04965
•
Published
•
64
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
•
2411.04996
•
Published
•
50
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
•
2501.04519
•
Published
•
190