RaymondAISG
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,161 @@
|
|
1 |
-
---
|
2 |
-
license: llama3
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- id
|
6 |
+
- ta
|
7 |
+
- th
|
8 |
+
- vi
|
9 |
+
---
|
10 |
+
# SEA-LIONv2
|
11 |
+
|
12 |
+
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
+
This model is continued pre-trained from the (Meta-Llama-3-8B-Instruct)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] model.
|
14 |
+
This is the card for the LLaMA3 8B SEA-LIONv2 base model.
|
15 |
+
|
16 |
+
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
|
17 |
+
|
18 |
+
|
19 |
+
## Model Details
|
20 |
+
|
21 |
+
### Model Description
|
22 |
+
|
23 |
+
The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
|
24 |
+
specifically trained to understand the SEA regional context.
|
25 |
+
|
26 |
+
For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
|
27 |
+
|
28 |
+
The continued pre-training data for LLaMA3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
|
29 |
+
|
30 |
+
- **Developed by:** Products Pillar, AI Singapore
|
31 |
+
- **Funded by:** Singapore NRF
|
32 |
+
- **Model type:** Decoder
|
33 |
+
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
|
34 |
+
- **License:** LLaMA3 Community License
|
35 |
+
|
36 |
+
### Performance Benchmarks
|
37 |
+
|
38 |
+
SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
|
39 |
+
|
40 |
+
| Model | ARC | BBH | HellaSwag | MMLU | GSM8k | Average |
|
41 |
+
|-------------|:-----:|:-----:|:---------:|:-----:|:------:|:-------:|
|
42 |
+
| SEA-LION 7B | 58.87 | 47.70 | 81.14 | 63.11 | 50.49 | 60.26 |
|
43 |
+
|
44 |
+
## Training Details
|
45 |
+
|
46 |
+
### Data
|
47 |
+
|
48 |
+
LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
|
49 |
+
|
50 |
+
| Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
|
51 |
+
|---------------------------|:-------------:|:----------:|:------------:|:----------:|
|
52 |
+
| Dolma RefinedWeb - English| 7.650B | 1 | 7.650B | 15.90% |
|
53 |
+
| Dolma C4 - English | 1.160B | 1 | 1B | 9.21% |
|
54 |
+
| Dolma Reddit - English | 1.339B | 1 | 14.7B | 2.42% |
|
55 |
+
| Dolma Semantic Scholar | 0.959B | 1 | 2.9B | 2.79% |
|
56 |
+
| Dolma arXiv | 0.469B | 1 | 5.3B | 1.99% |
|
57 |
+
| Dolma StarCoder | 4.422B | 1 | 4.9B | 0.98% |
|
58 |
+
| SEA-LION Pile - Indonesian| 3.4B | 1 | 6.8B | 14.17% |
|
59 |
+
| Wiki* - Indonesian | 0.3B | 4 | 1.2B | 2.50% |
|
60 |
+
| SEA-LION Pile - Tamil | 5.6B | 1 | 5.6B | 11.67% |
|
61 |
+
| Wiki* + News - Tamil | 0.6B | 4 | 2.4B | 5.00% |
|
62 |
+
| SEA-LION Pile - Thai | 2.28B | 1 | 2.28B | 4.75% |
|
63 |
+
| WangChanBERTa - Thai | 5B | 1 | 5B | 10.42% |
|
64 |
+
| Wiki* - Thai | 0.18B | 4 | 0.72B | 1.50% |
|
65 |
+
| SEA-LION Pile - Vietnamese| 6.76B | 1 | 6.76B | 14.08% |
|
66 |
+
| Wiki* - Vietnamese | 0.31B | 4 | 1.24B | 2.58% |
|
67 |
+
|
68 |
+
Note:
|
69 |
+
- All token counts are counted using LLaMA3 tokenizer
|
70 |
+
- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
|
71 |
+
- Source of Tamil news is source with permission from (Seithi)[https://seithi.mediacorp.sg/]
|
72 |
+
|
73 |
+
### Infrastructure
|
74 |
+
|
75 |
+
SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
|
76 |
+
on the following hardware:
|
77 |
+
|
78 |
+
| Training Details | LLaMA3 8B SEA-LIONv2 |
|
79 |
+
|----------------------|:--------------------:|
|
80 |
+
| AWS EC2 p5d.24xlarge | 8 instances |
|
81 |
+
| Nvidia H100 80GB GPU | 64 |
|
82 |
+
| Training Duration | 2 days |
|
83 |
+
|
84 |
+
|
85 |
+
### Configuration
|
86 |
+
|
87 |
+
| HyperParameter | LLaMA3 8B SEA-LIONv2 |
|
88 |
+
|-------------------|:--------------------:|
|
89 |
+
| Precision | bfloat16 |
|
90 |
+
| Optimizer | decoupled_adamw |
|
91 |
+
| Scheduler | weight_stable_decay |
|
92 |
+
| Learning Rate | 1.0e-5 |
|
93 |
+
| Global Batch Size | 512 |
|
94 |
+
| Micro Batch Size | 2 |
|
95 |
+
|
96 |
+
|
97 |
+
## The Team
|
98 |
+
|
99 |
+
Brandon Ong<br>
|
100 |
+
Bryan Siow<br>
|
101 |
+
Esther Choa<br>
|
102 |
+
Huang Yuli<br>
|
103 |
+
Lee Chwan Ren<br>
|
104 |
+
Leong Wai Yi<br>
|
105 |
+
Leong Wei Qi<br>
|
106 |
+
Li Yier<br>
|
107 |
+
Liu Bing Jie Darius<br>
|
108 |
+
Lovenia Holy<br>
|
109 |
+
Montalan Jann Railey<br>
|
110 |
+
Ng Boon Cheong Raymond<br>
|
111 |
+
Ngui Jian Gang<br>
|
112 |
+
Nguyen Thanh Ngan<br>
|
113 |
+
Nicholas Cheng<br>
|
114 |
+
Ong Tat-Wee David<br>
|
115 |
+
Ong Zhi Hao<br>
|
116 |
+
Rengarajan Hamsawardhini<br>
|
117 |
+
Susanto Yosephine<br>
|
118 |
+
Tai Ngee Chia<br>
|
119 |
+
Tan Choon Meng<br>
|
120 |
+
Teo Jin Howe<br>
|
121 |
+
Teo Eng Sipp Leslie<br>
|
122 |
+
Teo Wei Yi<br>
|
123 |
+
Tjhi William<br>
|
124 |
+
Walter Teng<br>
|
125 |
+
Wayne Lau<br>
|
126 |
+
Yeo Yeow Tong<br>
|
127 |
+
Yong Xianbin<br>
|
128 |
+
|
129 |
+
## Acknowledgements
|
130 |
+
|
131 |
+
AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
|
132 |
+
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
|
133 |
+
|
134 |
+
## Contact
|
135 |
+
|
136 |
+
For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
|
137 |
+
|
138 |
+
[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
|
139 |
+
|
140 |
+
|
141 |
+
## Disclaimer
|
142 |
+
|
143 |
+
This the repository for the base model.
|
144 |
+
The model has _not_ been aligned for safety.
|
145 |
+
Developers and users should perform their own safety fine-tuning and related security measures.
|
146 |
+
In no event shall the authors be held liable for any claim, damages, or other liability
|
147 |
+
arising from the use of the released weights and codes.
|
148 |
+
|
149 |
+
|
150 |
+
## References
|
151 |
+
|
152 |
+
```bibtex
|
153 |
+
@misc{lowphansirikul2021wangchanberta,
|
154 |
+
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
|
155 |
+
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
|
156 |
+
year={2021},
|
157 |
+
eprint={2101.09635},
|
158 |
+
archivePrefix={arXiv},
|
159 |
+
primaryClass={cs.CL}
|
160 |
+
}
|
161 |
+
```
|