Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

If our project helps you, please give us a star ⭐ on GitHub and cite our paper!

πŸ“° News

  • [2024.05.31] πŸ”₯ Our code is released!
  • [2024.05.25] πŸ”₯ Our checkpoints are available now!
  • [2024.05.23] πŸ”₯ Our paper is released!

😎 What's Interesting?

Dynamic Mixture of Experts (DynMoE) incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training.

Top-Any Gating

Adaptive Training Process

πŸ’‘ Model Details

  • πŸ€” DynMoE-Phi-2 is a MoE model with dynamic top-k gating, finetuned on LanguageBind/MoE-LLaVA-Phi2-Stage2.
  • πŸš€ Our DynMoE-Phi-2-2.7B has totally 5.3B parameters, but only 3.4B are activated! (average top-k = 1.68)
  • βŒ› With the DynMoE tuning stage, we can complete training on 8 A100 GPUs within 2 days.

πŸ‘ Acknowledgement

We are grateful for the following awesome projects:

πŸ”’ License

This project is released under the Apache-2.0 license as found in the LICENSE file.

✏️ Citation

@misc{guo2024dynamic,
      title={Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models}, 
      author={Yongxin Guo and Zhenglin Cheng and Xiaoying Tang and Tao Lin},
      year={2024},
      eprint={2405.14297},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Downloads last month
29
Safetensors
Model size
5.61B params
Tensor type
BF16
Β·
F32
Β·
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Collection including LINs-lab/DynMoE-Phi-2-2.7B