new5558 commited on
Commit
c041e1c
1 Parent(s): 4f922c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -2
README.md CHANGED
@@ -7,9 +7,9 @@ widget:
7
  - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
8
  library_name: transformers
9
  ---
10
- <!-- # HoogBERTa
11
 
12
- This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling. -->
13
 
14
 
15
  # Documentation
@@ -21,6 +21,65 @@ Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using [BES
21
  pip install attacut
22
  ```
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  # Citation
25
 
26
  Please cite as:
 
7
  - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
8
  library_name: transformers
9
  ---
10
+ # HoogBERTa
11
 
12
+ This repository includes the Thai pretrained language representation (HoogBERTa_base) fine-tuned for Named-Entity Recognition (NER) Task.
13
 
14
 
15
  # Documentation
 
21
  pip install attacut
22
  ```
23
 
24
+ ## Getting Start
25
+ To initialize the model from hub, use the following commands
26
+ ```python
27
+ from transformers import RobertaTokenizerFast, RobertaForTokenClassification
28
+ from attacut import tokenized
29
+ import torch
30
+
31
+ tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa")
32
+ model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa")
33
+ ```
34
+
35
+ To use NER Tagging, use the following commands
36
+
37
+ ```python
38
+ from transformers import pipeline
39
+
40
+ nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
41
+
42
+ sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
43
+ all_sent = []
44
+ sentences = sentence.split(" ")
45
+ for sent in sentences:
46
+ all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
47
+
48
+ sentence = " _ ".join(all_sent)
49
+
50
+ print(nlp(sentence))
51
+ ```
52
+
53
+ For batch processing,
54
+
55
+ ```python
56
+ from transformers import pipeline
57
+
58
+ nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
59
+
60
+ sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
61
+ inputList = []
62
+ for sentX in sentenceL:
63
+ sentences = sentX.split(" ")
64
+ all_sent = []
65
+ for sent in sentences:
66
+ all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
67
+
68
+ sentence = " _ ".join(all_sent)
69
+ inputList.append(sentence)
70
+
71
+ print(nlp(inputList))
72
+ ```
73
+
74
+ # Huggingface Models
75
+ 1. `HoogBERTaEncoder`
76
+ - [HoogBERTa](https://huggingface.co/new5558/HoogBERTa): `Feature Extraction` and `Mask Language Modeling`
77
+ 2. `HoogBERTaMuliTaskTagger`:
78
+ - [HoogBERTa-NER-lst20](https://huggingface.co/new5558/HoogBERTa-NER-lst20): `Named-entity recognition (NER)` based on LST20
79
+ - [HoogBERTa-POS-lst20](https://huggingface.co/new5558/HoogBERTa-POS-lst20): `Part-of-speech tagging (POS)` based on LST20
80
+ - [HoogBERTa-SENTENCE-lst20](https://huggingface.co/new5558/HoogBERTa-SENTENCE-lst20): `Clause Boundary Classification` based on LST20
81
+
82
+
83
  # Citation
84
 
85
  Please cite as: