Spaces:
Running
Running
xrsrke
commited on
Commit
·
5bfedd9
1
Parent(s):
8e93807
add fp8 loss curve
Browse files
assets/data/fp8/.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
assets/data/fp8/fp8_training_loss_curves.html
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dist/assets/data/fp8/fp8_training_loss_curves.html
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dist/index.html
CHANGED
@@ -2275,6 +2275,8 @@
|
|
2275 |
|
2276 |
<p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
|
2277 |
|
|
|
|
|
2278 |
<p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
|
2279 |
|
2280 |
<p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
|
|
|
2275 |
|
2276 |
<p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
|
2277 |
|
2278 |
+
<iframe class="l-body-outset" id="plotFP8Loss" src="assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
|
2279 |
+
|
2280 |
<p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
|
2281 |
|
2282 |
<p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
|
src/index.html
CHANGED
@@ -2275,6 +2275,8 @@
|
|
2275 |
|
2276 |
<p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
|
2277 |
|
|
|
|
|
2278 |
<p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
|
2279 |
|
2280 |
<p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
|
|
|
2275 |
|
2276 |
<p>We know that instability increases as learning rates rise for a fixed model size<d-cite bibtex-key="wortsman2023smallscaleproxieslargescaletransformer"></d-cite>, making FP8 pretraining particularly tricky.</p>
|
2277 |
|
2278 |
+
<iframe class="l-body-outset" id="plotFP8Loss" src="/assets/data/fp8/fp8_training_loss_curves.html" width="90%" scrolling="no" frameborder="0"></iframe>
|
2279 |
+
|
2280 |
<p>The first, successful, very large scale training with FP8 mixed precision was publicly reported on DeepSeek-V3. The authors carefully analyzed each operation of the forward pass (Fprop) as well as the activation (Dgrad) and weight (Wgrad) backward pass. Similar to BF16 mixed precision training, some aggregation and master weights are kept in higher precision while the operations themselves are performed in FP8. </p>
|
2281 |
|
2282 |
<p><img alt="image.png" src="/assets/images/fp8_diagram.png" /></p>
|