Fine-tuned multilingual model for russian language NER

This is the model card for fine-tuned Babelscape/wikineural-multilingual-ner, which has multilingual mBERT as its base. I`ve fine-tuned it using RCC-MSU/collection3 dataset for token-classification task. The dataset has BIO-pattern and following labels:

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

Model Details

Fine-tuning was proceeded in 3 epochs, and computed next metrics:

Epoch Training Loss Validation Loss Precision Recall F1 Accuracy
1 0.041000 0.032810 0.959569 0.974253 0.966855 0.993325
2 0.020800 0.028395 0.959569 0.974253 0.966855 0.993325
3 0.010500 0.029138 0.963239 0.973767 0.968474 0.993247

To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.

Basic usage

So, you can easily use this model with pipeline for 'token-classification' task.

import torch

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset


model_ckpt = "nesemenpolkov/msu-wiki-ner"

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
    model_ckpt,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)

pipe = pipeline(
    task="token-classification",
    model=model,
    tokenizer=tokenizer,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    aggregation_strategy="simple"
)

demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."

with torch.no_grad():
    out = pipe(demo_sample)

Bias, Risks, and Limitations

This model is finetuned version of Babelscape/wikineural-multilingual-ner, on a russian language NER dataset RCC-MSU/collection3. It can show low scores on another language texts.

Citation [optional]

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "Fine-tuned multilingual model for russian language NER.",
    author = "nesemenpolkov",
    booktitle = "Detecting names in noisy and dirty data.",
    month = oct,
    year = "2024",
    address = "Moscow, Russian Federation",
}
Downloads last month
271
Safetensors
Model size
177M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for nesemenpolkov/msu-wiki-ner

Finetuned
(4)
this model

Dataset used to train nesemenpolkov/msu-wiki-ner