droussis commited on
Commit
1bb95bb
·
verified ·
1 Parent(s): 9f0fe7c

Update README.md

Browse files

Add shortened and more correct version of tokens

Files changed (1) hide show
  1. README.md +8 -0
README.md CHANGED
@@ -35,6 +35,14 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
35
  | Math/Code | 5,951,964,497 | 6.6% |
36
  | **Total** | **89,653,165,085** | **100%** |
37
 
 
 
 
 
 
 
 
 
38
  Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
39
 
40
 
 
35
  | Math/Code | 5,951,964,497 | 6.6% |
36
  | **Total** | **89,653,165,085** | **100%** |
37
 
38
+ | Sub-corpus | # Tokens | Percentage |
39
+ |-----------|------------------|------------|
40
+ | Greek | 56.7 B | 62.3 % |
41
+ | English | 21.0 B | 23.1 % |
42
+ | Parallel | 5.5 B | 6.0 % |
43
+ | Math/Code | 7.8 B | 8.6 % |
44
+ | **Total** | 91 B | **100%** |
45
+
46
  Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
47
 
48