RaymondAISG commited on
Commit
3222a51
·
verified ·
1 Parent(s): d61c4be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -3
README.md CHANGED
@@ -1,3 +1,161 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ language:
4
+ - en
5
+ - id
6
+ - ta
7
+ - th
8
+ - vi
9
+ ---
10
+ # SEA-LIONv2
11
+
12
+ SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
+ This model is continued pre-trained from the (Meta-Llama-3-8B-Instruct)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] model.
14
+ This is the card for the LLaMA3 8B SEA-LIONv2 base model.
15
+
16
+ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
24
+ specifically trained to understand the SEA regional context.
25
+
26
+ For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
27
+
28
+ The continued pre-training data for LLaMA3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
29
+
30
+ - **Developed by:** Products Pillar, AI Singapore
31
+ - **Funded by:** Singapore NRF
32
+ - **Model type:** Decoder
33
+ - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
34
+ - **License:** LLaMA3 Community License
35
+
36
+ ### Performance Benchmarks
37
+
38
+ SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
39
+
40
+ | Model | ARC | BBH | HellaSwag | MMLU | GSM8k | Average |
41
+ |-------------|:-----:|:-----:|:---------:|:-----:|:------:|:-------:|
42
+ | SEA-LION 7B | 58.87 | 47.70 | 81.14 | 63.11 | 50.49 | 60.26 |
43
+
44
+ ## Training Details
45
+
46
+ ### Data
47
+
48
+ LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
49
+
50
+ | Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
51
+ |---------------------------|:-------------:|:----------:|:------------:|:----------:|
52
+ | Dolma RefinedWeb - English| 7.650B | 1 | 7.650B | 15.90% |
53
+ | Dolma C4 - English | 1.160B | 1 | 1B | 9.21% |
54
+ | Dolma Reddit - English | 1.339B | 1 | 14.7B | 2.42% |
55
+ | Dolma Semantic Scholar | 0.959B | 1 | 2.9B | 2.79% |
56
+ | Dolma arXiv | 0.469B | 1 | 5.3B | 1.99% |
57
+ | Dolma StarCoder | 4.422B | 1 | 4.9B | 0.98% |
58
+ | SEA-LION Pile - Indonesian| 3.4B | 1 | 6.8B | 14.17% |
59
+ | Wiki* - Indonesian | 0.3B | 4 | 1.2B | 2.50% |
60
+ | SEA-LION Pile - Tamil | 5.6B | 1 | 5.6B | 11.67% |
61
+ | Wiki* + News - Tamil | 0.6B | 4 | 2.4B | 5.00% |
62
+ | SEA-LION Pile - Thai | 2.28B | 1 | 2.28B | 4.75% |
63
+ | WangChanBERTa - Thai | 5B | 1 | 5B | 10.42% |
64
+ | Wiki* - Thai | 0.18B | 4 | 0.72B | 1.50% |
65
+ | SEA-LION Pile - Vietnamese| 6.76B | 1 | 6.76B | 14.08% |
66
+ | Wiki* - Vietnamese | 0.31B | 4 | 1.24B | 2.58% |
67
+
68
+ Note:
69
+ - All token counts are counted using LLaMA3 tokenizer
70
+ - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
71
+ - Source of Tamil news is source with permission from (Seithi)[https://seithi.mediacorp.sg/]
72
+
73
+ ### Infrastructure
74
+
75
+ SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
76
+ on the following hardware:
77
+
78
+ | Training Details | LLaMA3 8B SEA-LIONv2 |
79
+ |----------------------|:--------------------:|
80
+ | AWS EC2 p5d.24xlarge | 8 instances |
81
+ | Nvidia H100 80GB GPU | 64 |
82
+ | Training Duration | 2 days |
83
+
84
+
85
+ ### Configuration
86
+
87
+ | HyperParameter | LLaMA3 8B SEA-LIONv2 |
88
+ |-------------------|:--------------------:|
89
+ | Precision | bfloat16 |
90
+ | Optimizer | decoupled_adamw |
91
+ | Scheduler | weight_stable_decay |
92
+ | Learning Rate | 1.0e-5 |
93
+ | Global Batch Size | 512 |
94
+ | Micro Batch Size | 2 |
95
+
96
+
97
+ ## The Team
98
+
99
+ Brandon Ong<br>
100
+ Bryan Siow<br>
101
+ Esther Choa<br>
102
+ Huang Yuli<br>
103
+ Lee Chwan Ren<br>
104
+ Leong Wai Yi<br>
105
+ Leong Wei Qi<br>
106
+ Li Yier<br>
107
+ Liu Bing Jie Darius<br>
108
+ Lovenia Holy<br>
109
+ Montalan Jann Railey<br>
110
+ Ng Boon Cheong Raymond<br>
111
+ Ngui Jian Gang<br>
112
+ Nguyen Thanh Ngan<br>
113
+ Nicholas Cheng<br>
114
+ Ong Tat-Wee David<br>
115
+ Ong Zhi Hao<br>
116
+ Rengarajan Hamsawardhini<br>
117
+ Susanto Yosephine<br>
118
+ Tai Ngee Chia<br>
119
+ Tan Choon Meng<br>
120
+ Teo Jin Howe<br>
121
+ Teo Eng Sipp Leslie<br>
122
+ Teo Wei Yi<br>
123
+ Tjhi William<br>
124
+ Walter Teng<br>
125
+ Wayne Lau<br>
126
+ Yeo Yeow Tong<br>
127
+ Yong Xianbin<br>
128
+
129
+ ## Acknowledgements
130
+
131
+ AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
132
+ Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
133
+
134
+ ## Contact
135
+
136
+ For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
137
+
138
+ [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
139
+
140
+
141
+ ## Disclaimer
142
+
143
+ This the repository for the base model.
144
+ The model has _not_ been aligned for safety.
145
+ Developers and users should perform their own safety fine-tuning and related security measures.
146
+ In no event shall the authors be held liable for any claim, damages, or other liability
147
+ arising from the use of the released weights and codes.
148
+
149
+
150
+ ## References
151
+
152
+ ```bibtex
153
+ @misc{lowphansirikul2021wangchanberta,
154
+ title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
155
+ author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
156
+ year={2021},
157
+ eprint={2101.09635},
158
+ archivePrefix={arXiv},
159
+ primaryClass={cs.CL}
160
+ }
161
+ ```