hfunakura
/

en-bertsemtagger-gold

Token Classification

Inference Endpoints

Model card Files Files and versions

en-bertsemtagger-gold / README.md

hfunakura's picture

Update README.md

f4f9e11 verified 4 months ago

|

3.57 kB

	---
	language:
	- en
	pipeline_tag: token-classification
	tags:
	- semantics
	license: apache-2.0
	---
	# An English semantic tagging model based on `bert-base-uncased`
	This model is a BERT-base-uncased model finetuned for semantic tagging.

	As training data, I use the English fragment (both gold and silver data) from the Parallel Meaning Bank's Universal Semantic Tags dataset [1].

	## Inference
	The model is trained to make predictions for the embedded representations corresponding to the first subword of each word. Inference in the same setting as in training can be achieved with the following code ([huggingface's standard pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) does not behave as intended here). Note that the pipeline below assumes that inputs are already split into words by spaces.
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from spacy_alignments.tokenizations import get_alignments
	import torch

	tokenizer = AutoTokenizer.from_pretrained("hfunakura/en-bertsemtagger-gold")
	model = AutoModelForTokenClassification.from_pretrained("hfunakura/en-bertsemtagger-gold")

	# define the tagset
	id2semtag = {"0": "@@UNK@@", "1": "PRO", "2": "CTC", "3": "INT", "4": "EMP", "5": "DEC", "6": "ITJ", "7": "GRE", "8": "NEC", "9": "PFT", "10": "IMP", "11": "HAP", "12": "ROL", "13": "MOY", "14": "PRG", "15": "HAS", "16": "CLO", "17": "MOR", "18": "DEF", "19": "BUT", "20": "YOC", "21": "PRI", "22": "EQU", "23": "SUB", "24": "APX", "25": "REL", "26": "XCL", "27": "CON", "28": "GPO", "29": "QUE", "30": "DIS", "31": "IST", "32": "COL", "33": "SCO", "34": "GRP", "35": "EXS", "36": "FUT", "37": "ENS", "38": "QUC", "39": "DOM", "40": "SST", "41": "NIL", "42": "COO", "43": "QUV", "44": "PST", "45": "UNK", "46": "EXT", "47": "NTH", "48": "LIT", "49": "ORG", "50": "EXG", "51": "REF", "52": "DOW", "53": "TOP", "54": "EPS", "55": "DXT", "56": "AND", "57": "UOM", "58": "ALT", "59": "POS", "60": "PRX", "61": "GEO", "62": "BOT", "63": "DEG", "64": "ART", "65": "PER", "66": "GPE", "67": "EFS", "68": "DST", "69": "LES", "70": "ORD", "71": "NOT", "72": "NOW", "-100": "@@PAD@@"}

	class SemtaggerPipeline():
	def __init__(self, model, tokenizer, id2semtag):
	self.model = model
	self.tokenizer = tokenizer
	self.id2semtag = id2semtag
	def predict(self, text):
	# get alignments
	encoding_list = self.tokenizer(text, add_special_tokens=False)
	encoded_tokens = self.tokenizer.convert_ids_to_tokens(encoding_list["input_ids"])
	words = text.split(" ")
	alignments = get_alignments(encoded_tokens, words)[1]
	is_first_list = []
	for alignment in alignments:
	is_first_list += [1] + [0]*(len(alignment)-1)
	is_first = torch.tensor(is_first_list)
	# yield and extract predictions
	encoding = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)
	logits = model(**encoding).logits
	preds = logits.argmax(-1)[0][is_first==1]
	pred_labels = [self.id2semtag[str(int(i))] for i in preds]
	result = [f"{word}/{label}" for word, label in zip(words,pred_labels)]
	return " ".join(result)

	pipeline = SemtaggerPipeline(model, tokenizer, id2semtag)
	pipeline.predict("Jim and Mary smiled and left .")
	```

	## References
	[1] Lasha Abzianidze, Johan Bos (2017): Towards Universal Semantic Tagging. Proceedings of the 12th International Conference on Computational Semantics (IWCS 2017) -- Short Papers, pp 1–6, Montpellier, France, https://pmb.let.rug.nl/data.php.