lst-nectec
/

HoogBERTa-POS-lst20

 widget:
   - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
 library_name: transformers
+---
+# HoogBERTa
+This repository includes the Thai pretrained language representation (HoogBERTa_base) fine-tuned for Part-of-Speech Tagging (POS) Task.
+# Documentation
+## Prerequisite
+Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using [BEST](https://huggingface.co/datasets/best2009) standard before inputting into HoogBERTa
+```
+pip install attacut
+```
+## Getting Start
+To initialize the model from hub, use the following commands
+```python
+from transformers import RobertaTokenizerFast, RobertaForTokenClassification
+from attacut import tokenized
+import torch
+tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa-POS-lst20")
+model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa-POS-lst20")
+```
+To use NER Tagging, use the following commands
+```python
+from transformers import pipeline
+nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
+sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
+all_sent = []
+sentences = sentence.split(" ")
+for sent in sentences:
+    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
+sentence = " _ ".join(all_sent)
+print(nlp(sentence))
+```
+For batch processing,
+```python
+from transformers import pipeline
+nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
+sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
+inputList = []
+for sentX in sentenceL:
+  sentences = sentX.split(" ")
+  all_sent = []
+  for sent in sentences:
+      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
+  sentence = " _ ".join(all_sent)
+  inputList.append(sentence)
+print(nlp(inputList))
+```
+# Huggingface Models
+1. `HoogBERTaEncoder`
+ - [HoogBERTa](https://huggingface.co/new5558/HoogBERTa): `Feature Extraction` and `Mask Language Modeling`
+2. `HoogBERTaMuliTaskTagger`:
+ - [HoogBERTa-NER-lst20](https://huggingface.co/new5558/HoogBERTa-NER-lst20): `Named-entity recognition (NER)` based on LST20
+ - [HoogBERTa-POS-lst20](https://huggingface.co/new5558/HoogBERTa-POS-lst20): `Part-of-speech tagging (POS)` based on LST20
+ - [HoogBERTa-SENTENCE-lst20](https://huggingface.co/new5558/HoogBERTa-SENTENCE-lst20): `Clause Boundary Classification` based on LST20
+# Citation
+Please cite as:
+``` bibtex
+@inproceedings{porkaew2021hoogberta,
+  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
+  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
+  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
+  year = {2021},
+  address={Online}
+}
+```
+Download full-text [PDF](https://drive.google.com/file/d/1hwdyIssR5U_knhPE2HJigrc0rlkqWeLF/view?usp=sharing)
+Check out the code on [Github](https://github.com/lstnlp/HoogBERTa)