Crystalcareai commited on
Commit
b78d608
1 Parent(s): ce5bb18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -19,7 +19,7 @@ Coming from a background as an acting teacher and coach, I saw parallels between
19
 
20
  I curated a dataset, named Mixture of Data (MoD), from various sources, including Bagel, OpenHermes, and many more, totaling about 780,000 distinct ShareGPT conversations. This dataset aims to encourage MoE models to develop their own distinct experts.
21
 
22
- After training Qwen1.5-7b on 100k random samples from MoD over three epochs and merging the fine-tuned model 8x, I used an approach utilizing a random gate, without specialized fine-tuning done to any of the 8 experts. The result was a model that initially made no sense, lacking a base model and clear guidance on expert usage.
23
 
24
  Despite challenges, such as training interruptions via cuda errors with Runpod , the model showed promising adaptability to the rest of the MoD dataset, even with limited training (0.45/4 planned epochs were completed before my compute budget ran out). While I haven't been able to benchmark it fully (I will when I can get this runpod situation sorted) it appears to perform comparably to Mixtral in (admittedly naive) preliminary reasoning tests.
25
 
 
19
 
20
  I curated a dataset, named Mixture of Data (MoD), from various sources, including Bagel, OpenHermes, and many more, totaling about 780,000 distinct ShareGPT conversations. This dataset aims to encourage MoE models to develop their own distinct experts.
21
 
22
+ After training Qwen1.5-7b on 100k random samples from MoD over four epochs and merging the fine-tuned model 8x, I used an approach utilizing a random gate, without specialized fine-tuning done to any of the 8 experts. The result was a model that initially made no sense, lacking a base model and clear guidance on expert usage.
23
 
24
  Despite challenges, such as training interruptions via cuda errors with Runpod , the model showed promising adaptability to the rest of the MoD dataset, even with limited training (0.45/4 planned epochs were completed before my compute budget ran out). While I haven't been able to benchmark it fully (I will when I can get this runpod situation sorted) it appears to perform comparably to Mixtral in (admittedly naive) preliminary reasoning tests.
25