Model Merging
Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it!
Paper • 1412.6544 • Published • 4Note Analyzes the optimization landscape of training neural networks using linear interpolation experiments.
Convergent Learning: Do different neural networks learn the same representations?
Paper • 1511.07543 • Published • 2
Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
Paper • 1909.11299 • Published • 2Note Mixout is a regularization technique that improves stability and performance of LLMs on downstream tasks by stochastically mixing parameters of two models. Mixout acts as an L2 regularizer and prevents catastrophic forgetting/divergence.
Model Fusion via Optimal Transport
Paper • 1910.05653 • Published • 1Note A layer-wise fusion algorithm. This allows one-shot knowledge transfer without retraining and outperforms normal averaging. It also enables fusing models of different size, hence facilitating compression and federated learning
Federated Learning with Matched Averaging
Paper • 2002.06440 • Published • 2Note FedMA is a layer-wise federated learning algorithm for CNNs and LSTMs that averages hidden elements with similar feature extraction signatures.
Merging Models with Fisher-Weighted Averaging
Paper • 2111.09832 • Published • 1Note Fisher merging, a weighted averaging method for combining NNs. It shows better performance over standard (unweighted )parameter averaging in model ensembling. It's a cheaper alternative to traditional transfer learning methods.
On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks
Paper • 2110.15538 • Published • 1Note CLAFusion is a method to fuse neural networks doing cross-layer alignment and layer balancing in an efficient way. This works with networks with different depths.
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Paper • 2203.05482 • Published • 6Note Average weights of multiple fine-tuned models for computer vision. This yields better accuracy and no extra inference cost, and improves robustness to distribution shift.
Fusing finetuned models for better pretraining
Paper • 2204.03044 • Published • 5Note Fuse multiple fine-tuned models by averaging their weights. It creates a better base model for future target tasks, showing better results vs using a pretrained model. This was published almost at the same time as model soups, but in this paper the approach is to have a generalizable base model that is then fine-tuned on different target tasks.
Diverse Weight Averaging for Out-of-Distribution Generalization
Paper • 2205.09739 • Published • 1
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Paper • 2208.03306 • Published • 2Note BTM is an algorithm to independently train expert LMs on different textual domains.
Git Re-Basin: Merging Models modulo Permutation Symmetries
Paper • 2209.04836 • Published • 1Note Proposes algorithms to align weights of independently trained models by permitting units
lo-fi: distributed fine-tuning without communication
Paper • 2210.11948 • Published • 1Note Lo-fi achieves similar or better accuracy compared to standard distributed training with communication when fine-tuning vision transformers on image classification and language models on text, without requiring any communication between nodes during training.
ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning
Paper • 2212.01378 • Published • 1Note Iterative method to improve models by fusing together fine-tuned models without sharing datasets. First, you pick a base model. Different contributors download the base model and fine-tune on their own dataset. We then fuse/average all the fine-tunes to improve the base model. We repeat this process to generate a stronger base model with better performance and gains in few-shot learning.
Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization
Paper • 2212.10445 • Published • 2Note Fine-tuning strategy that reuses fine-tunes of the same base models as initialization to parallel fine-tunings on the same target task. This leads to strong out-of-distribution generalization by leveraging diversity across auxiliary tasks. This technique is more robust to the choice of auxiliary tasks compared to other reusing strategies.
Backward Compatibility During Data Updates by Weight Interpolation
Paper • 2301.10546 • Published • 2Note Improves backwards comaptibility of models when re-finetuning on a bigger dataset
ZipIt! Merging Models from Different Tasks without Training
Paper • 2305.03053 • Published • 2
Resolving Interference When Merging Models
Paper • 2306.01708 • Published • 13Note A new method, TIES-MERGING (TRIM, ELECT SIGN & MERGE), that merges multiple models into a single multitask model. TRIM addresses interference that can be caused by redundant and conflicting parameters across models. Incorrect signs at top parameters cam lead to huge performance drops, so SIGN resolution help with this.
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Paper • 2306.04488 • Published • 2Note This is a technique to align LLMs using human preferences by interpolating weights fine-tuned on different proxy rewards.
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
Paper • 2307.13269 • Published • 32
Unified Model for Image, Video, Audio and Language Tasks
Paper • 2307.16184 • Published • 15Note UnIVAL proposes model merging across different modalities (image, video, audio-text tasks).
Model Merging by Uncertainty-Based Gradient Matching
Paper • 2310.12808 • Published • 6Note Why model merging works, when it can fail, and how it can be improved by unifying many existing merging schemes.
Averaging Weights Leads to Wider Optima and Better Generalization
Paper • 1803.05407 • Published • 2WARM: On the Benefits of Weight Averaged Reward Models
Paper • 2401.12187 • Published • 18Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging
Paper • 2209.14981 • PublishedEarly Weight Averaging meets High Learning Rates for LLM Pre-training
Paper • 2306.03241 • Published • 2Arcee's MergeKit: A Toolkit for Merging Large Language Models
Paper • 2403.13257 • Published • 20Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 51Editing Models with Task Arithmetic
Paper • 2212.04089 • Published • 6Merging Improves Self-Critique Against Jailbreak Attacks
Paper • 2406.07188 • Published • 3