Expand data source table
#6
by
RaymondAISG
- opened
README.md
CHANGED
@@ -43,27 +43,27 @@ SEA-LION has an average performance on general tasks in English (as measured by
|
|
43 |
|
44 |
SEA-LION was trained on 980B tokens of the following data:
|
45 |
|
46 |
-
| Data Source | Tokens | Percentage |
|
47 |
-
|
48 |
-
| RefinedWeb - English | 571.3B | 58.20% |
|
49 |
-
| mC4 - Chinese |
|
50 |
-
| mC4 - Indonesian |
|
51 |
-
| mC4 - Malay |
|
52 |
-
| mC4 - Filipino |
|
53 |
-
| mC4 - Burmese |
|
54 |
-
| mC4 - Vietnamese |
|
55 |
-
| mC4 - Thai |
|
56 |
-
| WangChanBERTa - Thai |
|
57 |
-
| mC4 - Lao |
|
58 |
-
| mC4 - Khmer |
|
59 |
-
| mC4 - Tamil |
|
60 |
-
| the Stack - Python |
|
61 |
-
| the Stack - Javascript |
|
62 |
-
| the Stack - Shell |
|
63 |
-
| the Stack - SQL | 12.8B | 1.31% |
|
64 |
-
| the Stack - Markdown |
|
65 |
-
| RedPajama - StackExchange |
|
66 |
-
| RedPajama - ArXiv |
|
67 |
|
68 |
### Infrastructure
|
69 |
|
|
|
43 |
|
44 |
SEA-LION was trained on 980B tokens of the following data:
|
45 |
|
46 |
+
| Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
|
47 |
+
|---------------------------|:-------------:|:----------:|:------------:|:----------:|
|
48 |
+
| RefinedWeb - English | 571.3B | 1 | 571.3B | 58.20% |
|
49 |
+
| mC4 - Chinese | 91.2B | 1 | 91.2B | 9.29% |
|
50 |
+
| mC4 - Indonesian | 3.68B | 4 | 14.7B | 1.50% |
|
51 |
+
| mC4 - Malay | 0.72B | 4 | 2.9B | 0.29% |
|
52 |
+
| mC4 - Filipino | 1.32B | 4 | 5.3B | 0.54% |
|
53 |
+
| mC4 - Burmese | 1.2B | 4 | 4.9B | 0.49% |
|
54 |
+
| mC4 - Vietnamese | 63.4B | 1 | 63.4B | 6.46% |
|
55 |
+
| mC4 - Thai | 5.8B | 2 | 11.6B | 1.18% |
|
56 |
+
| WangChanBERTa - Thai | 5B | 2 | 10B | 1.02% |
|
57 |
+
| mC4 - Lao | 0.27B | 4 | 1.1B | 0.12% |
|
58 |
+
| mC4 - Khmer | 0.97B | 4 | 3.9B | 0.40% |
|
59 |
+
| mC4 - Tamil | 2.55B | 4 | 10.2B | 1.04% |
|
60 |
+
| the Stack - Python | 20.9B | 2 | 41.8B | 4.26% |
|
61 |
+
| the Stack - Javascript | 55.6B | 1 | 55.6B | 5.66% |
|
62 |
+
| the Stack - Shell | 1.2B5 | 2 | 2.5B | 0.26% |
|
63 |
+
| the Stack - SQL | 6.4B | 2 | 12.8B | 1.31% |
|
64 |
+
| the Stack - Markdown | 26.6B | 1 | 26.6B | 2.71% |
|
65 |
+
| RedPajama - StackExchange | 21.2B | 1 | 21.2B | 2.16% |
|
66 |
+
| RedPajama - ArXiv | 30.6B | 1 | 30.6B | 3.12% |
|
67 |
|
68 |
### Infrastructure
|
69 |
|