crumb's picture
Update README.md
1018288
|
raw
history blame
843 Bytes
metadata
datasets:
  - crumb/Wizard-EvolInstruct70k-k4
language:
  - en
tags:
  - switch_transformers
  - llama
  - MoE

This is the very first testing switchllama model from MoLora2, starting from OpenLlama-3b-v2 and adding 4 experts in the MLP blocks of the model. The experts were trained with QLora and merged properly (in 4bit) after individually training adapters on gate_proj, up_proj, down_proj. The 4 expert models were trained on clusters from crumb/Wizard-EvolInstruct70k-k4 then their trained MLP weights were taken and transplanted in a model initialized from OpenLlama-3b with 4 switchtransformer experts. The routers are not trained in this version of the model and are randomly initialized.

Modeling code is not included until this proof-of-concept is entirely trained.