BERTić-COMtext-SR-legal-MSD-ekavica

BERTić-COMtext-SR-legal-MSD-ekavica is a variant of the BERTić model, fine-tuned on the task of morphosyntactic (MSD) tag prediction in Serbian legal texts written in the Ekavian pronunciation. The model was fine-tuned for 15 epochs on the Ekavian variant of the COMtext.SR.legal dataset.

Benchmarking

This model was evaluated on the tasks of MSD prediction and lemmatization of Serbian legal texts. Lemmatization was performed using the predicted MSD tags and the srLex inflectional lexicon.

Accuracy and Word Error Rate were used as evaluation metrics.

This model was compared to:

  • The CLASSLA library
  • A variant of BERTić fine-tuned for MSD prediction using the SETimes.SR 2.0 corpus of newswire texts
  • SrBERTa, a model specially trained on Serbian legal texts

All large language models were fine-tuned for 15 epochs. CLASSLA and BERTić-SETimes were directly tested on the entire COMtext.SR.legal.ekavica corpus. BERTić-COMtext-SR-legal-MSD-ekavica and SrBERTa were fine-tuned and evaluated on the COMtext.SR.legal.ekavica corpus using 10-fold CV.

The code and data to run these experiments is available on the COMtext.SR GitHub repository.

Results

Model MSD ACC MSD WER Lemma ACC Lemma WER
CLASSLA-SR (gold tokens) 0.9144 0.0856 0.9432 0.0568
CLASSLA-SR (CLASSLA tokenizer) / 0.0983 / 0.0739
BERTić-SETimes (gold tokens) 0.9231 0.0768 0.9649 0.0351
BERTić-SETimes.SR (CLASSLA tokenizer) / 0.0884 / 0.0542
BERTić-COMtext-SR-legal-MSD-ekavica (gold tokens) 0.9674 0.0326 0.9666 0.0334
BERTić-COMtext-SR-legal-MSD-ekavica (CLASSLA tokenizer) / 0.0447 / 0.0526
SrBERTa (gold tokens) 0.9288 0.0712 0.9391 0.0609
SrBERTa (CLASSLA tokenizer) / 0.0851 / 0.0819
Downloads last month
7
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.