tiiuae
/

falcon-mamba-7b

Text Generation

Inference Endpoints

Model card Files Files and versions Community

Ilyas Chahed commited on Jul 22, 2024

Commit

34e00ff

·

verified ·

1 Parent(s): 7191604

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -153,8 +153,7 @@ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the traini
 | Optimizer          | AdamW      |                                           |
 | Max learning rate      | 6.4e-4       | Following a WSD (warmup-stable-decay) learning rate schedule |
 | Weight decay       | 1e-1       |                                           |
-| Z-loss             | 1e-4       |                                           |
-| Batch size         | 2048-4096        |                         |
 The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.

 | Optimizer          | AdamW      |                                           |
 | Max learning rate      | 6.4e-4       | Following a WSD (warmup-stable-decay) learning rate schedule |
 | Weight decay       | 1e-1       |                                           |
+| Batch size         | 2048        |                         |
 The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.