Fine-tuned multilingual model for russian language NER
This is the model card for fine-tuned Babelscape/wikineural-multilingual-ner, which has multilingual mBERT as its base. I`ve fine-tuned it using RCC-MSU/collection3 dataset for token-classification task. The dataset has BIO-pattern and following labels:
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
Model Details
Fine-tuning was proceeded in 3 epochs, and computed next metrics:
Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|
1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 |
To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.
Basic usage
So, you can easily use this model with pipeline for 'token-classification' task.
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset
model_ckpt = "nesemenpolkov/msu-wiki-ner"
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
model_ckpt,
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True
)
pipe = pipeline(
task="token-classification",
model=model,
tokenizer=tokenizer,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
aggregation_strategy="simple"
)
demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."
with torch.no_grad():
out = pipe(demo_sample)
Bias, Risks, and Limitations
This model is finetuned version of Babelscape/wikineural-multilingual-ner, on a russian language NER dataset RCC-MSU/collection3. It can show low scores on another language texts.
Citation [optional]
@inproceedings{tedeschi-etal-2021-wikineural-combined,
title = "Fine-tuned multilingual model for russian language NER.",
author = "nesemenpolkov",
booktitle = "Detecting names in noisy and dirty data.",
month = oct,
year = "2024",
address = "Moscow, Russian Federation",
}
- Downloads last month
- 271
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for nesemenpolkov/msu-wiki-ner
Base model
Babelscape/wikineural-multilingual-ner