File size: 7,602 Bytes
4ef40e3 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 5f778cd 869d903 4ef40e3 4bade9b 4ef40e3 5f778cd 4ef40e3 5f778cd 4ef40e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
---
license: apache-2.0
language:
- sl
- hr
- sr
- mk
- cs
- bs
- bg
- pl
- ru
- uk
- sk
- sq
pipeline_tag: token-classification
model-index:
- name: xlmr-ner-slavic
results:
- task:
type: token-classification
metrics:
- name: Accuracy
type: Accuracy
value: 98.346
- name: F1-score
type: F1-score
value: 93.158
- name: Precision
type: Precision
value: 92.700
- name: Recall
type: Recall
value: 93.622
- name: LOC Precision
type: LOC Precision
value: 94.105
- name: LOC Recall
type: LOC Recall
value: 95.513
- name: LOC F1-score
type: LOC F1-score
value: 94.804
- name: MISC Precision
type: MISC Precision
value: 85.196
- name: MISC Recall
type: MISC Recall
value: 85.545
- name: MISC F1-score
type: MISC F1-score
value: 85.370
- name: ORG Precision
type: ORG Precision
value: 91.226
- name: ORG Recall
type: ORG Recall
value: 91.519
- name: ORG F1-score
type: ORG F1-score
value: 91.372
- name: PER Precision
type: PER Precision
value: 94.995
- name: PER Recall
type: PER Recall
value: 96.191
- name: PER F1-score
type: PER F1-score
value: 95.589
---
## XLM-Roberta-base NER model for slavic languages
The train / eval / test splits were concatenated from all languages in order as specified in command line:
`sl, hr, sr, bs, mk, sq, cs, bg, pl, ru, sk, uk`
We used the following hyper-parameters:
* 256 max-length for tokenizer
* PyTorch's AdamW algorithm with 2e-5 learning rate
* batch size of 20
* 40 epochs (preliminary runs showed best F1-scores between epochs 15 and 35)
* F1-score for best model selection and training progression.
<!---
```
{
"xlmrb-sl_hr_sr_bs_mk_sq_cs_bg_pl_ru_sk_uk": {
"LOC": {
"precision": 0.9410536270144608,
"recall": 0.955128974205159,
"f1": 0.9480390600190536,
"number": 25005
},
"MISC": {
"precision": 0.8519650655021834,
"recall": 0.8554516223326513,
"f1": 0.8537047841306884,
"number": 6842
},
"ORG": {
"precision": 0.9122568093385214,
"recall": 0.915194691129111,
"f1": 0.9137233887075559,
"number": 20494
},
"PER": {
"precision": 0.9499552728357022,
"recall": 0.9619061996779388,
"f1": 0.955893384007601,
"number": 19872
},
"overall_precision": 0.9269994926711549,
"overall_recall": 0.9362164707185687,
"overall_f1": 0.931585184368627,
"overall_accuracy": 0.9834613206674987
}
}
```
-->
Based on
[Analysis of Transfer Learning for Named Entity Recognition in South-Slavic Languages](https://aclanthology.org/2023.bsnlp-1.13) (Ivačič et al., BSNLP 2023)
## Used NER Corpora
We used the following NER corpora
- [Training corpus SUK 1.0](https://www.clarin.si/repository/xmlui/handle/11356/1747)
```
@misc{11356/1747,
title = {Training corpus {SUK} 1.0},
author = {Arhar Holdt, {\v S}pela and Krek, Simon and Dobrovoljc, Kaja and Erjavec, Toma{\v z} and Gantar, Polona and {\v C}ibej, Jaka and Pori, Eva and Ter{\v c}on, Luka and Munda, Tina and {\v Z}itnik, Slavko and Robida, Nejc and Blagus, Neli and Mo{\v z}e, Sara and Ledinek, Nina and Holz, Nanika and Zupan, Katja and Kuzman, Taja and Kav{\v c}i{\v c}, Teja and {\v S}krjanec, Iza and Marko, Dafne and Jezer{\v s}ek, Lucija and Zajc, Anja},
url = {http://hdl.handle.net/11356/1747},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
issn = {2820-4042},
year = {2022}
}
```
- [BSNLP: 3rd Shared Task on SlavNER](http://bsnlp.cs.helsinki.fi/shared-task.html)
We merged 2017+2021 train data with 2021 test data and made custom train / dev / test splits.
We also mapped EVT (event) and PRO (product) tags to MISC to align the corpus with others.
You can change mappings running a custom prepare corpus step (see above).
- [Training corpus hr500k 1.0](https://www.clarin.si/repository/xmlui/handle/11356/1183)
```
@misc{11356/1183,
title = {Training corpus hr500k 1.0},
author = {Ljube{\v s}i{\'c}, Nikola and Agi{\'c}, {\v Z}eljko and Klubi{\v c}ka, Filip and Batanovi{\'c}, Vuk and Erjavec, Toma{\v z}},
url = {http://hdl.handle.net/11356/1183},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2018}
}
```
- [Training corpus SETimes.SR 1.0](https://www.clarin.si/repository/xmlui/handle/11356/1200)
```
@misc{11356/1200,
title = {Training corpus {SETimes}.{SR} 1.0},
author = {Batanovi{\'c}, Vuk and Ljube{\v s}i{\'c}, Nikola and Samard{\v z}i{\'c}, Tanja and Erjavec, Toma{\v z}},
url = {http://hdl.handle.net/11356/1200},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2018}
}
```
- [Massively Multilingual Transfer for NER.](https://github.com/afshinrahimi/mmner) nick-named WikiAnn
```
@inproceedings{rahimi-etal-2019-massively,
title = "Massively Multilingual Transfer for {NER}",
author = "Rahimi, Afshin and
Li, Yuan and
Cohn, Trevor",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P19-1015",
pages = "151--164",
}
```
- [Neural Networks for Featureless Named Entity Recognition in Czech.](https://github.com/strakova/ner_tsd2016)
```
@Inbook{Strakova2016,
author="Strakov{\'a}, Jana and Straka, Milan and Haji{\v{c}}, Jan",
editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
title="Neural Networks for Featureless Named Entity Recognition in Czech",
bookTitle="Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno , Czech Republic, September 12-16, 2016, Proceedings",
year="2016",
publisher="Springer International Publishing",
address="Cham",
pages="173--181",
isbn="978-3-319-45510-5",
doi="10.1007/978-3-319-45510-5_20",
url="http://dx.doi.org/10.1007/978-3-319-45510-5_20"
}
```
### NER Evaluation
For evaluation, we use [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval)
```
@misc{seqeval,
title={{seqeval}: A Python framework for sequence labeling evaluation},
url={https://github.com/chakki-works/seqeval},
note={Software available from https://github.com/chakki-works/seqeval},
author={Hiroki Nakayama},
year={2018},
}
```
Which is based on
```
@inproceedings{ramshaw-marcus-1995-text,
title = "Text Chunking using Transformation-Based Learning",
author = "Ramshaw, Lance and
Marcus, Mitch",
booktitle = "Third Workshop on Very Large Corpora",
year = "1995",
url = "https://www.aclweb.org/anthology/W95-0107",
}
``` |