|
--- |
|
license: apache-2.0 |
|
--- |
|
# Wtf is a MoEification?! |
|
Turns out, you can slice up the individual MLP layers of a dense language model into even splits of experts. |
|
|
|
What I did here: |
|
- Split the MLP projections (gate, down, proj) into the amount of total experts you want (in this case, I just went with 8 experts). |
|
- Multiply the values of the parameters for the down-projection by the total amount of experts (so the magnitude of the activation outputs, when averaged linearly together, ends up being equivalent) |
|
- Initialize the router layers with zeroes, so the expert usage is completely equal by default and has no unintentional biases as a consequence of random initialization being done the normal way. |
|
|
|
As a result, the model behaves completely coherently when all 8 experts are activated (i.e, experts_per_tok is equal to 8.) |
|
|
|
 |
|
|
|
With 4 experts activated, it's... far less coherent. |
|
|
|
 |
|
|
|
# Ok but why? |
|
|
|
I am interested in the prospect of continuing to train this in such a way where it can naturally handle variable expert counts, and learn to balance the features. |
|
If this works, we can potentially teach the behavior of using less computation for tokens that are trivial to predict, while using more when necessary. |
|
|
|
# Also thanks StefanGliga for giving me the idea while we were discussing [this paper](https://arxiv.org/abs/2303.01610) :3 |