Update README.md
Browse files
README.md
CHANGED
@@ -311,8 +311,8 @@ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outl
|
|
311 |
|
312 |
![lang distrib](./images/corpus_languages_1.1.png)
|
313 |
|
314 |
-
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53
|
315 |
-
Following this, Starcoder provides 13
|
316 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
317 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
318 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
311 |
|
312 |
![lang distrib](./images/corpus_languages_1.1.png)
|
313 |
|
314 |
+
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53.05% of the total tokens.
|
315 |
+
Following this, Starcoder provides 13.67%, and FineWeb-Edu (350BT subset) adds 10.24%. The next largest sources are HPLT at 4.21% and French-PD at 3.59%.
|
316 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
317 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
318 |
The remaining 10% comes from smaller sources in various languages.
|