Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
Ilyas Chahed commited on
Commit
34e00ff
·
verified ·
1 Parent(s): 7191604

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -153,8 +153,7 @@ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the traini
153
  | Optimizer | AdamW | |
154
  | Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
155
  | Weight decay | 1e-1 | |
156
- | Z-loss | 1e-4 | |
157
- | Batch size | 2048-4096 | |
158
 
159
 
160
  The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
 
153
  | Optimizer | AdamW | |
154
  | Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
155
  | Weight decay | 1e-1 | |
156
+ | Batch size | 2048 | |
 
157
 
158
 
159
  The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.