Ilyas Chahed
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -153,8 +153,7 @@ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the traini
|
|
153 |
| Optimizer | AdamW | |
|
154 |
| Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
|
155 |
| Weight decay | 1e-1 | |
|
156 |
-
|
|
157 |
-
| Batch size | 2048-4096 | |
|
158 |
|
159 |
|
160 |
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
|
|
153 |
| Optimizer | AdamW | |
|
154 |
| Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
|
155 |
| Weight decay | 1e-1 | |
|
156 |
+
| Batch size | 2048 | |
|
|
|
157 |
|
158 |
|
159 |
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|