Doge 160M checkpoint

NOTE: This model is training, you can find the real-time training logs on wandb.

Visualize in Weights & Biases

wsd_scheduler

Doge uses wsd_scheduler as the training scheduler, which divides the learning rate into three stages: warmup, stable, and decay. It allows us to continue training on any new dataset from any checkpoint in the stable stage without spikes of the training.

Here are the initial learning rates required to continue training at each checkpoint:

Model Learning Rate Schedule Warmup Steps Stable Steps
Doge-20M 8e-3 wsd_scheduler 800 6400
Doge-60M 6e-3 wsd_scheduler 1600 12800
Doge-160M 4e-3 wsd_scheduler 2400 19200
Doge-320M 2e-3 wsd_scheduler 3200 25600
Downloads last month
97
Safetensors
Model size
153M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train JingzeShi/Doge-160M-checkpoint

Collection including JingzeShi/Doge-160M-checkpoint