metadata
datasets:
- crumb/Wizard-EvolInstruct70k-k4
language:
- en
tags:
- switch_transformers
- llama
- MoE
This is the very first testing switchllama model from MoLora2, starting from OpenLlama-3b-v2 and adding 4 experts in the MLP blocks of the model. The experts were trained with QLora and merged properly (in 4bit) after individually training adapters on gate_proj, up_proj, down_proj
. The 4 expert models were trained on clusters from crumb/Wizard-EvolInstruct70k-k4 then their trained MLP weights were taken and transplanted in a model initialized from OpenLlama-3b with 4 switchtransformer experts. The routers are not trained in this version of the model and are randomly initialized.
Modeling code is not included until this proof-of-concept is entirely trained.