tainc commited on
Commit
570bc0e
·
verified ·
1 Parent(s): 1de799c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -77
README.md CHANGED
@@ -6,28 +6,28 @@ language:
6
  - th
7
  - vi
8
  license: llama3
 
 
9
  ---
10
  # Llama3 8B CPT SEA-LIONv2
11
 
12
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
- This is the card for the Llama3 8B CPT SEA-LIONv2 base model which has undergone continued pre-training from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.
14
 
15
- SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
16
-
17
-
18
- ## Model Details
19
 
20
- ### Model Description
21
-
22
- The continued pre-training data for Llama3 8B CPT SEA-LIONv2 base model encompasses approximately 48B tokens.
23
 
24
  - **Developed by:** Products Pillar, AI Singapore
25
  - **Funded by:** Singapore NRF
26
  - **Model type:** Decoder
27
- - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
28
  - **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
29
 
30
- For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
 
 
 
 
31
 
32
  ### Benchmark Performance
33
  We evaluated Llama3 8B CPT SEA-LIONv2 base model on general language capabilities.
@@ -41,9 +41,27 @@ The evaluation was done **five-shot** with native prompts and only a sample of 1
41
  For more details on Llama3 8B CPT SEA-LIONv2 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
42
 
43
  ## Training Details
 
 
 
 
 
 
 
 
 
44
 
45
- ### Data
 
 
 
 
 
 
 
 
46
 
 
47
  Llama3 8B CPT SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
48
 
49
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
@@ -69,83 +87,25 @@ Note:
69
  - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
70
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
71
 
72
- ### Infrastructure
73
-
74
- Llama3 8B CPT SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
75
- on the following hardware:
76
-
77
- | Training Details | Llama3 8B CPT SEA-LIONv2 |
78
- |----------------------|:--------------------:|
79
- | AWS EC2 p5d.24xlarge | 8 instances |
80
- | Nvidia H100 80GB GPU | 64 |
81
- | Training Duration | 2 days |
82
-
83
-
84
- ### Configuration
85
-
86
- | HyperParameter | Llama3 8B CPT SEA-LIONv2 |
87
- |-------------------|:--------------------:|
88
- | Precision | bfloat16 |
89
- | Optimizer | decoupled_adamw |
90
- | Scheduler | weight_stable_decay |
91
- | Learning Rate | 1.0e-5 |
92
- | Global Batch Size | 512 |
93
- | Micro Batch Size | 2 |
94
-
95
 
96
  ## The Team
97
-
98
- Choa Esther<br>
99
- Cheng Nicholas<br>
100
- Huang Yuli<br>
101
- Lau Wayne<br>
102
- Lee Chwan Ren<br>
103
- Leong Wai Yi<br>
104
- Leong Wei Qi<br>
105
- Li Yier<br>
106
- Liu Bing Jie Darius<br>
107
- Lovenia Holy<br>
108
- Montalan Jann Railey<br>
109
- Ng Boon Cheong Raymond<br>
110
- Ngui Jian Gang<br>
111
- Nguyen Thanh Ngan<br>
112
- Ong Brandon<br>
113
- Ong Tat-Wee David<br>
114
- Ong Zhi Hao<br>
115
- Rengarajan Hamsawardhini<br>
116
- Siow Bryan<br>
117
- Susanto Yosephine<br>
118
- Tai Ngee Chia<br>
119
- Tan Choon Meng<br>
120
- Teo Eng Sipp Leslie<br>
121
- Teo Wei Yi<br>
122
- Tjhi William<br>
123
- Teng Walter<br>
124
- Yeo Yeow Tong<br>
125
- Yong Xianbin<br>
126
-
127
 
128
  ## Acknowledgements
129
-
130
- AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
131
- Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
132
-
133
 
134
  ## Contact
 
135
 
136
- For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
137
-
138
- [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
139
-
140
 
141
  ## Disclaimer
142
-
143
- This the repository for the base model.
144
  The model has _not_ been aligned for safety.
145
  Developers and users should perform their own safety fine-tuning and related security measures.
146
- In no event shall the authors be held liable for any claim, damages, or other liability
147
- arising from the use of the released weights and codes.
148
-
149
 
150
  ## References
151
  ### Thai Pre-Training Data Reference
 
6
  - th
7
  - vi
8
  license: llama3
9
+ base_model: meta-llama/Meta-Llama-3-8B-Instruct
10
+ new_version: aisingapore/llama3.1-8b-cpt-sea-lionv3-base
11
  ---
12
  # Llama3 8B CPT SEA-LIONv2
13
 
14
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
 
15
 
16
+ Llama3 8B CPT SEA-LIONv2 Base is a multilingual model which has undergone continued pre-training on approximately **48B** tokens across 5 SEA languages: English, Indonesia, Tamil, Thai and Vietnamese.
 
 
 
17
 
18
+ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
 
 
19
 
20
  - **Developed by:** Products Pillar, AI Singapore
21
  - **Funded by:** Singapore NRF
22
  - **Model type:** Decoder
23
+ - **Languages supported:** English, Indonesian, Thai, Vietnamese, Tamil
24
  - **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
25
 
26
+ ## Model Details
27
+ ### Model Description
28
+ We performed continued pre-training in English and SEA languages on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), a decoder model using the Llama 3 architecture, to create Llama3 8B CPT SEA-LIONv2 Base.
29
+
30
+ For tokenisation, the model employs the default tokenizer used in Llama 3 8B Instruct.
31
 
32
  ### Benchmark Performance
33
  We evaluated Llama3 8B CPT SEA-LIONv2 base model on general language capabilities.
 
41
  For more details on Llama3 8B CPT SEA-LIONv2 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
42
 
43
  ## Training Details
44
+ ### Infrastructure
45
+ Llama3 8B CPT SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
46
+ on the following hardware:
47
+
48
+ | Training Details | Llama3 8B CPT SEA-LIONv2 |
49
+ |----------------------|:--------------------:|
50
+ | AWS EC2 p5d.24xlarge | 8 instances |
51
+ | Nvidia H100 80GB GPU | 64 |
52
+ | Training Duration | 2 days |
53
 
54
+ ### Configuration
55
+ | HyperParameter | Llama3 8B CPT SEA-LIONv2 |
56
+ |-------------------|:--------------------:|
57
+ | Precision | bfloat16 |
58
+ | Optimizer | decoupled_adamw |
59
+ | Scheduler | weight_stable_decay |
60
+ | Learning Rate | 1.0e-5 |
61
+ | Global Batch Size | 512 |
62
+ | Micro Batch Size | 2 |
63
 
64
+ ## Data
65
  Llama3 8B CPT SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
66
 
67
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
 
87
  - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
88
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
89
 
90
+ ## Call for Contributions
91
+ We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-LION. Contributions can involve identifying and reporting bugs, sharing pre-training, instruction, and preference data, improving documentation usability, proposing and implementing new model evaluation tasks and metrics, or training versions of the model in additional Southeast Asian languages. Join us in shaping the future of SEA-LION by sharing your expertise and insights to make these models more accessible, accurate, and versatile. Please check out our GitHub for further information on the call for contributions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ## The Team
94
+ Cheng Nicholas, Choa Esther, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Li Yier, Liu Bing Jie Darius, Lovenia Holy, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ## Acknowledgements
97
+ [AI Singapore](​​https://aisingapore.org/) is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation or the National University of Singapore.
 
 
 
98
 
99
  ## Contact
100
+ For more info, please contact us using this [SEA-LION Inquiry Form.](https://forms.gle/sLCUVb95wmGf43hi6)
101
 
102
+ [Link to SEA-LION's GitHub repository.](https://github.com/aisingapore/sealion)
 
 
 
103
 
104
  ## Disclaimer
105
+ This is the repository for the commercial instruction-tuned model.
 
106
  The model has _not_ been aligned for safety.
107
  Developers and users should perform their own safety fine-tuning and related security measures.
108
+ In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.
 
 
109
 
110
  ## References
111
  ### Thai Pre-Training Data Reference