|
--- |
|
language: |
|
- en |
|
pipeline_tag: token-classification |
|
tags: |
|
- semantics |
|
license: apache-2.0 |
|
--- |
|
# An English semantic tagging model based on `bert-base-uncased` |
|
This model is a BERT-base-uncased model finetuned for **semantic tagging**. |
|
|
|
As training data, I use the English fragment (both gold and silver data) from the Parallel Meaning Bank's Universal Semantic Tags dataset [1]. |
|
|
|
## Inference |
|
The model is trained to make predictions for the embedded representations corresponding to the first subword of each word. Inference in the same setting as in training can be achieved with the following code ([huggingface's standard pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) does not behave as intended here). Note that the pipeline below assumes that inputs are already split into words by spaces. |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from spacy_alignments.tokenizations import get_alignments |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("hfunakura/en-bertsemtagger-gold") |
|
model = AutoModelForTokenClassification.from_pretrained("hfunakura/en-bertsemtagger-gold") |
|
|
|
# define the tagset |
|
id2semtag = {"0": "@@UNK@@", "1": "PRO", "2": "CTC", "3": "INT", "4": "EMP", "5": "DEC", "6": "ITJ", "7": "GRE", "8": "NEC", "9": "PFT", "10": "IMP", "11": "HAP", "12": "ROL", "13": "MOY", "14": "PRG", "15": "HAS", "16": "CLO", "17": "MOR", "18": "DEF", "19": "BUT", "20": "YOC", "21": "PRI", "22": "EQU", "23": "SUB", "24": "APX", "25": "REL", "26": "XCL", "27": "CON", "28": "GPO", "29": "QUE", "30": "DIS", "31": "IST", "32": "COL", "33": "SCO", "34": "GRP", "35": "EXS", "36": "FUT", "37": "ENS", "38": "QUC", "39": "DOM", "40": "SST", "41": "NIL", "42": "COO", "43": "QUV", "44": "PST", "45": "UNK", "46": "EXT", "47": "NTH", "48": "LIT", "49": "ORG", "50": "EXG", "51": "REF", "52": "DOW", "53": "TOP", "54": "EPS", "55": "DXT", "56": "AND", "57": "UOM", "58": "ALT", "59": "POS", "60": "PRX", "61": "GEO", "62": "BOT", "63": "DEG", "64": "ART", "65": "PER", "66": "GPE", "67": "EFS", "68": "DST", "69": "LES", "70": "ORD", "71": "NOT", "72": "NOW", "-100": "@@PAD@@"} |
|
|
|
class SemtaggerPipeline(): |
|
def __init__(self, model, tokenizer, id2semtag): |
|
self.model = model |
|
self.tokenizer = tokenizer |
|
self.id2semtag = id2semtag |
|
def predict(self, text): |
|
# get alignments |
|
encoding_list = self.tokenizer(text, add_special_tokens=False) |
|
encoded_tokens = self.tokenizer.convert_ids_to_tokens(encoding_list["input_ids"]) |
|
words = text.split(" ") |
|
alignments = get_alignments(encoded_tokens, words)[1] |
|
is_first_list = [] |
|
for alignment in alignments: |
|
is_first_list += [1] + [0]*(len(alignment)-1) |
|
is_first = torch.tensor(is_first_list) |
|
# yield and extract predictions |
|
encoding = self.tokenizer(text, return_tensors="pt", add_special_tokens=False) |
|
logits = model(**encoding).logits |
|
preds = logits.argmax(-1)[0][is_first==1] |
|
pred_labels = [self.id2semtag[str(int(i))] for i in preds] |
|
result = [f"{word}/{label}" for word, label in zip(words,pred_labels)] |
|
return " ".join(result) |
|
|
|
pipeline = SemtaggerPipeline(model, tokenizer, id2semtag) |
|
pipeline.predict("Jim and Mary smiled and left .") |
|
``` |
|
|
|
## References |
|
[1] Lasha Abzianidze, Johan Bos (2017): Towards Universal Semantic Tagging. Proceedings of the 12th International Conference on Computational Semantics (IWCS 2017) -- Short Papers, pp 1–6, Montpellier, France, https://pmb.let.rug.nl/data.php. |