Update README.md
Browse filesAdd shortened and more correct version of tokens
README.md
CHANGED
@@ -35,6 +35,14 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
|
|
35 |
| Math/Code | 5,951,964,497 | 6.6% |
|
36 |
| **Total** | **89,653,165,085** | **100%** |
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
|
39 |
|
40 |
|
|
|
35 |
| Math/Code | 5,951,964,497 | 6.6% |
|
36 |
| **Total** | **89,653,165,085** | **100%** |
|
37 |
|
38 |
+
| Sub-corpus | # Tokens | Percentage |
|
39 |
+
|-----------|------------------|------------|
|
40 |
+
| Greek | 56.7 B | 62.3 % |
|
41 |
+
| English | 21.0 B | 23.1 % |
|
42 |
+
| Parallel | 5.5 B | 6.0 % |
|
43 |
+
| Math/Code | 7.8 B | 8.6 % |
|
44 |
+
| **Total** | 91 B | **100%** |
|
45 |
+
|
46 |
Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
|
47 |
|
48 |
|