language: th
license: cc-by-sa-4.0
tags:
- word segmentation
datasets:
- best2010
- lst20
- tlc
- vistec-tp-th-2021
- wisesight_sentiment
pipeline_tag: token-classification
Multi-criteria BERT base Thai with Lattice for Word Segmentation
This is a variant of the pre-trained model BERT model. The model was pre-trained on texts in the Thai language and fine-tuned for word segmentation based on bert-base-multilingual-cased. This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.
The scripts for the pre-training are available at tchayintr/latte-ptm-ws.
The LATTE scripts are available at tchayintr/latte-ws.
Model architecture
The model architecture is described in this paper.
Training Data
The model is trained on multiple Thai word segmented datasets, including best2010, lst20, tlc (tnhc), vistec-tp-th-2021 (vistec2021) and wisesight_sentiment (ws160). The datasets can be accessed as follows:
Licenses
The pre-trained model is distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.
Acknowledgments
This model was trained with GPU servers provided by Okumura-Funakoshi NLP Group.