XLM-Roberta-base NER model for slavic languages
The train / eval / test splits were concatenated from all languages in order as specified in command line:sl, hr, sr, bs, mk, sq, cs, bg, pl, ru, sk, uk
We used the following hyper-parameters:
- 256 max-length for tokenizer
- PyTorch's AdamW algorithm with 2e-5 learning rate
- batch size of 20
- 40 epochs (preliminary runs showed best F1-scores between epochs 15 and 35)
- F1-score for best model selection and training progression.
Based on Analysis of Transfer Learning for Named Entity Recognition in South-Slavic Languages (Ivačič et al., BSNLP 2023)
Used NER Corpora
We used the following NER corpora
@misc{11356/1747,
title = {Training corpus {SUK} 1.0},
author = {Arhar Holdt, {\v S}pela and Krek, Simon and Dobrovoljc, Kaja and Erjavec, Toma{\v z} and Gantar, Polona and {\v C}ibej, Jaka and Pori, Eva and Ter{\v c}on, Luka and Munda, Tina and {\v Z}itnik, Slavko and Robida, Nejc and Blagus, Neli and Mo{\v z}e, Sara and Ledinek, Nina and Holz, Nanika and Zupan, Katja and Kuzman, Taja and Kav{\v c}i{\v c}, Teja and {\v S}krjanec, Iza and Marko, Dafne and Jezer{\v s}ek, Lucija and Zajc, Anja},
url = {http://hdl.handle.net/11356/1747},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
issn = {2820-4042},
year = {2022}
}
BSNLP: 3rd Shared Task on SlavNER
We merged 2017+2021 train data with 2021 test data and made custom train / dev / test splits.
We also mapped EVT (event) and PRO (product) tags to MISC to align the corpus with others.
You can change mappings running a custom prepare corpus step (see above).
@misc{11356/1183,
title = {Training corpus hr500k 1.0},
author = {Ljube{\v s}i{\'c}, Nikola and Agi{\'c}, {\v Z}eljko and Klubi{\v c}ka, Filip and Batanovi{\'c}, Vuk and Erjavec, Toma{\v z}},
url = {http://hdl.handle.net/11356/1183},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2018}
}
@misc{11356/1200,
title = {Training corpus {SETimes}.{SR} 1.0},
author = {Batanovi{\'c}, Vuk and Ljube{\v s}i{\'c}, Nikola and Samard{\v z}i{\'c}, Tanja and Erjavec, Toma{\v z}},
url = {http://hdl.handle.net/11356/1200},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2018}
}
- Massively Multilingual Transfer for NER. nick-named WikiAnn
@inproceedings{rahimi-etal-2019-massively,
title = "Massively Multilingual Transfer for {NER}",
author = "Rahimi, Afshin and
Li, Yuan and
Cohn, Trevor",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P19-1015",
pages = "151--164",
}
@Inbook{Strakova2016,
author="Strakov{\'a}, Jana and Straka, Milan and Haji{\v{c}}, Jan",
editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
title="Neural Networks for Featureless Named Entity Recognition in Czech",
bookTitle="Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno , Czech Republic, September 12-16, 2016, Proceedings",
year="2016",
publisher="Springer International Publishing",
address="Cham",
pages="173--181",
isbn="978-3-319-45510-5",
doi="10.1007/978-3-319-45510-5_20",
url="http://dx.doi.org/10.1007/978-3-319-45510-5_20"
}
NER Evaluation
For evaluation, we use seqeval
@misc{seqeval,
title={{seqeval}: A Python framework for sequence labeling evaluation},
url={https://github.com/chakki-works/seqeval},
note={Software available from https://github.com/chakki-works/seqeval},
author={Hiroki Nakayama},
year={2018},
}
Which is based on
@inproceedings{ramshaw-marcus-1995-text,
title = "Text Chunking using Transformation-Based Learning",
author = "Ramshaw, Lance and
Marcus, Mitch",
booktitle = "Third Workshop on Very Large Corpora",
year = "1995",
url = "https://www.aclweb.org/anthology/W95-0107",
}
- Downloads last month
- 60
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Evaluation results
- Accuracyself-reported98.346
- F1-scoreself-reported93.158
- Precisionself-reported92.700
- Recallself-reported93.622
- LOC Precisionself-reported94.105
- LOC Recallself-reported95.513
- LOC F1-scoreself-reported94.804
- MISC Precisionself-reported85.196
- MISC Recallself-reported85.545
- MISC F1-scoreself-reported85.370