aisingapore
/

llama3-8b-cpt-sea-lionv2-base

+---
+license: llama3
+language:
+- en
+- id
+- ta
+- th
+- vi
+---
+# SEA-LIONv2
+SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
+This model is continued pre-trained from the (Meta-Llama-3-8B-Instruct)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] model.
+This is the card for the LLaMA3 8B SEA-LIONv2 base model.
+SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
+## Model Details
+### Model Description
+The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
+specifically trained to understand the SEA regional context.
+For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
+The continued pre-training data for LLaMA3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
+- **Developed by:** Products Pillar, AI Singapore
+- **Funded by:** Singapore NRF
+- **Model type:** Decoder
+- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
+- **License:** LLaMA3 Community License
+### Performance Benchmarks
+SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
+| Model       | ARC   | BBH   | HellaSwag | MMLU  | GSM8k  | Average |
+|-------------|:-----:|:-----:|:---------:|:-----:|:------:|:-------:|
+| SEA-LION 7B | 58.87 | 47.70 |   81.14   | 63.11 |  50.49 | 60.26   |
+## Training Details
+### Data
+LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
+| Data Source               | Unique Tokens | Multiplier | Total Tokens | Percentage |
+|---------------------------|:-------------:|:----------:|:------------:|:----------:|
+| Dolma RefinedWeb - English|        7.650B |          1 |       7.650B |     15.90% |
+| Dolma C4 - English        |        1.160B |          1 |           1B |      9.21% |
+| Dolma Reddit - English    |        1.339B |          1 |        14.7B |      2.42% |
+| Dolma Semantic Scholar    |        0.959B |          1 |         2.9B |      2.79% |
+| Dolma arXiv               |        0.469B |          1 |         5.3B |      1.99% |
+| Dolma StarCoder           |        4.422B |          1 |         4.9B |      0.98% |
+| SEA-LION Pile - Indonesian|          3.4B |          1 |         6.8B |     14.17% |
+| Wiki* - Indonesian        |          0.3B |          4 |         1.2B |      2.50% |
+| SEA-LION Pile - Tamil     |          5.6B |          1 |         5.6B |     11.67% |
+| Wiki* + News - Tamil      |          0.6B |          4 |         2.4B |      5.00% |
+| SEA-LION Pile - Thai      |         2.28B |          1 |        2.28B |      4.75% |
+| WangChanBERTa - Thai      |            5B |          1 |           5B |     10.42% |
+| Wiki* - Thai              |         0.18B |          4 |        0.72B |      1.50% |
+| SEA-LION Pile - Vietnamese|         6.76B |          1 |        6.76B |     14.08% |
+| Wiki* - Vietnamese        |         0.31B |          4 |        1.24B |      2.58% |
+Note:
+- All token counts are counted using LLaMA3 tokenizer
+- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
+- Source of Tamil news is source with permission from (Seithi)[https://seithi.mediacorp.sg/]
+### Infrastructure
+SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
+on the following hardware:
+| Training Details     | LLaMA3 8B SEA-LIONv2 |
+|----------------------|:--------------------:|
+| AWS EC2 p5d.24xlarge |          8 instances |
+| Nvidia H100 80GB GPU |          64          |
+| Training Duration    |          2 days      |
+### Configuration
+| HyperParameter    | LLaMA3 8B SEA-LIONv2 |
+|-------------------|:--------------------:|
+| Precision         | bfloat16             |
+| Optimizer         | decoupled_adamw      |
+| Scheduler         | weight_stable_decay  |
+| Learning Rate     | 1.0e-5               |
+| Global Batch Size | 512                  |
+| Micro Batch Size  | 2                    |
+## The Team
+Brandon Ong<br>
+Bryan Siow<br>
+Esther Choa<br>
+Huang Yuli<br>
+Lee Chwan Ren<br>
+Leong Wai Yi<br>
+Leong Wei Qi<br>
+Li Yier<br>
+Liu Bing Jie Darius<br>
+Lovenia Holy<br>
+Montalan Jann Railey<br>
+Ng Boon Cheong Raymond<br>
+Ngui Jian Gang<br>
+Nguyen Thanh Ngan<br>
+Nicholas Cheng<br>
+Ong Tat-Wee David<br>
+Ong Zhi Hao<br>
+Rengarajan Hamsawardhini<br>
+Susanto Yosephine<br>
+Tai Ngee Chia<br>
+Tan Choon Meng<br>
+Teo Jin Howe<br>
+Teo Eng Sipp Leslie<br>
+Teo Wei Yi<br>
+Tjhi William<br>
+Walter Teng<br>
+Wayne Lau<br>
+Yeo Yeow Tong<br>
+Yong Xianbin<br>
+## Acknowledgements
+AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
+Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
+## Contact
+For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
+[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
+## Disclaimer
+This the repository for the base model.
+The model has _not_ been aligned for safety.
+Developers and users should perform their own safety fine-tuning and related security measures.
+In no event shall the authors be held liable for any claim, damages, or other liability
+arising from the use of the released weights and codes.
+## References
+```bibtex
+@misc{lowphansirikul2021wangchanberta,
+    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
+    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
+    year={2021},
+    eprint={2101.09635},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```