Spaces:
Running
Running
bourdoiscatie
commited on
Update dist/index.html
Browse files- dist/index.html +1 -1
dist/index.html
CHANGED
@@ -58,7 +58,7 @@
|
|
58 |
|
59 |
That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
|
60 |
|
61 |
-
This article presents the optimisations we have implemented to efficiently pre-train a T5 with 147M
|
62 |
To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
|
63 |
<strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
|
64 |
<p class="width_125"><br><br><br></p>
|
|
|
58 |
|
59 |
That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
|
60 |
|
61 |
+
This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 2,200 euros).
|
62 |
To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
|
63 |
<strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
|
64 |
<p class="width_125"><br><br><br></p>
|