Doge 160M checkpoint

Doge uses wsd_scheduler as the training scheduler, which divides the learning rate into three stages: warmup, stable, and decay. It allows us to continue training on any new dataset from any checkpoint in the stable stage without spikes of the training.

Here are the initial learning rates required to continue training at each checkpoint:

Doge-20M: 8e-3
Doge-60M: 6e-3
Doge-160M: 4e-3
Doge-320M: 2e-3

Model	Learning Rate	Schedule	Warmup Steps	Stable Steps
Doge-20M	8e-3	wsd_scheduler	800	6400
Doge-60M	6e-3	wsd_scheduler	1600	12800
Doge-160M	4e-3	wsd_scheduler	2400	19200
Doge-320M	2e-3	wsd_scheduler	3200	25600

SmallDoge
/

Doge-160M-checkpoint

Doge 160M checkpoint

Dataset used to train SmallDoge/Doge-160M-checkpoint

Collection including SmallDoge/Doge-160M-checkpoint

🐶Doge-CheckPoint