Files changed (1) hide show
  1. README.md +21 -21
README.md CHANGED
@@ -43,27 +43,27 @@ SEA-LION has an average performance on general tasks in English (as measured by
43
 
44
  SEA-LION was trained on 980B tokens of the following data:
45
 
46
- | Data Source | Tokens | Percentage |
47
- |---------------------------|-------:|:----------:|
48
- | RefinedWeb - English | 571.3B | 58.20% |
49
- | mC4 - Chinese | 91.2B | 9.29% |
50
- | mC4 - Indonesian | 14.7B | 1.50% |
51
- | mC4 - Malay | 2.9B | 0.29% |
52
- | mC4 - Filipino | 5.3B | 0.54% |
53
- | mC4 - Burmese | 4.9B | 0.49% |
54
- | mC4 - Vietnamese | 63.4B | 6.46% |
55
- | mC4 - Thai | 11.6B | 1.18% |
56
- | WangChanBERTa - Thai | 10B | 1.02% |
57
- | mC4 - Lao | 1.1B | 0.12% |
58
- | mC4 - Khmer | 3.9B | 0.40% |
59
- | mC4 - Tamil | 10.2B | 1.04% |
60
- | the Stack - Python | 41.8B | 4.26% |
61
- | the Stack - Javascript | 55.6B | 5.66% |
62
- | the Stack - Shell | 2.5B | 0.26% |
63
- | the Stack - SQL | 12.8B | 1.31% |
64
- | the Stack - Markdown | 26.6B | 2.71% |
65
- | RedPajama - StackExchange | 21.2B | 2.16% |
66
- | RedPajama - ArXiv | 30.6B | 3.12% |
67
 
68
  ### Infrastructure
69
 
 
43
 
44
  SEA-LION was trained on 980B tokens of the following data:
45
 
46
+ | Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
47
+ |---------------------------|:-------------:|:----------:|:------------:|:----------:|
48
+ | RefinedWeb - English | 571.3B | 1 | 571.3B | 58.20% |
49
+ | mC4 - Chinese | 91.2B | 1 | 91.2B | 9.29% |
50
+ | mC4 - Indonesian | 3.68B | 4 | 14.7B | 1.50% |
51
+ | mC4 - Malay | 0.72B | 4 | 2.9B | 0.29% |
52
+ | mC4 - Filipino | 1.32B | 4 | 5.3B | 0.54% |
53
+ | mC4 - Burmese | 1.2B | 4 | 4.9B | 0.49% |
54
+ | mC4 - Vietnamese | 63.4B | 1 | 63.4B | 6.46% |
55
+ | mC4 - Thai | 5.8B | 2 | 11.6B | 1.18% |
56
+ | WangChanBERTa - Thai | 5B | 2 | 10B | 1.02% |
57
+ | mC4 - Lao | 0.27B | 4 | 1.1B | 0.12% |
58
+ | mC4 - Khmer | 0.97B | 4 | 3.9B | 0.40% |
59
+ | mC4 - Tamil | 2.55B | 4 | 10.2B | 1.04% |
60
+ | the Stack - Python | 20.9B | 2 | 41.8B | 4.26% |
61
+ | the Stack - Javascript | 55.6B | 1 | 55.6B | 5.66% |
62
+ | the Stack - Shell | 1.2B5 | 2 | 2.5B | 0.26% |
63
+ | the Stack - SQL | 6.4B | 2 | 12.8B | 1.31% |
64
+ | the Stack - Markdown | 26.6B | 1 | 26.6B | 2.71% |
65
+ | RedPajama - StackExchange | 21.2B | 1 | 21.2B | 2.16% |
66
+ | RedPajama - ArXiv | 30.6B | 1 | 30.6B | 3.12% |
67
 
68
  ### Infrastructure
69