lst-nectec
/

HoogBERTa-NER-lst20

Token Classification

Inference Endpoints

Model card Files Files and versions Community

new5558 commited on Apr 5, 2023

Commit

c041e1c

•

1 Parent(s): 4f922c5

Update README.md

Files changed (1) hide show

README.md +61 -2

README.md CHANGED Viewed

@@ -7,9 +7,9 @@ widget:
   - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
 library_name: transformers
 ---
-<!-- # HoogBERTa
-This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling.   -->
 # Documentation
@@ -21,6 +21,65 @@ Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using [BES
 pip install attacut
 ```
 # Citation
 Please cite as:

   - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
 library_name: transformers
 ---
+# HoogBERTa
+This repository includes the Thai pretrained language representation (HoogBERTa_base) fine-tuned for Named-Entity Recognition (NER) Task.
 # Documentation
 pip install attacut
 ```
+## Getting Start
+To initialize the model from hub, use the following commands
+```python
+from transformers import RobertaTokenizerFast, RobertaForTokenClassification
+from attacut import tokenized
+import torch
+tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa")
+model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa")
+```
+To use NER Tagging, use the following commands
+```python
+from transformers import pipeline
+nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
+sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
+all_sent = []
+sentences = sentence.split(" ")
+for sent in sentences:
+    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
+sentence = " _ ".join(all_sent)
+print(nlp(sentence))
+```
+For batch processing,
+```python
+from transformers import pipeline
+nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
+sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
+inputList = []
+for sentX in sentenceL:
+  sentences = sentX.split(" ")
+  all_sent = []
+  for sent in sentences:
+      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
+  sentence = " _ ".join(all_sent)
+  inputList.append(sentence)
+print(nlp(inputList))
+```
+# Huggingface Models
+1. `HoogBERTaEncoder`
+ - [HoogBERTa](https://huggingface.co/new5558/HoogBERTa): `Feature Extraction` and `Mask Language Modeling`
+2. `HoogBERTaMuliTaskTagger`:
+ - [HoogBERTa-NER-lst20](https://huggingface.co/new5558/HoogBERTa-NER-lst20): `Named-entity recognition (NER)` based on LST20
+ - [HoogBERTa-POS-lst20](https://huggingface.co/new5558/HoogBERTa-POS-lst20): `Part-of-speech tagging (POS)` based on LST20
+ - [HoogBERTa-SENTENCE-lst20](https://huggingface.co/new5558/HoogBERTa-SENTENCE-lst20): `Clause Boundary Classification` based on LST20
 # Citation
 Please cite as: