Adafactor: Adaptive Learning Rates with Sublinear Memory Cost Paper • 1804.04235 • Published Apr 11, 2018 • 2
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Paper • 2305.10429 • Published May 17, 2023 • 3
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Paper • 2305.13245 • Published May 22, 2023 • 5