license: eupl-1.1
datasets:
- ehri-ner/ehri-ner-all
language:
- cs
- de
- en
- fr
- hu
- nl
- pl
- sk
- yi
metrics:
- name: f1
type: f1
value: 81.5
pipeline_tag: token-classification
tags:
- Holocaust
- EHRI
base_model: FacebookAI/xlm-roberta-large
Model Card for ehri-ner/xlm-roberta-large-ehri-ner-all
The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER) using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts. The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%.
Model Description
- Developed by: Dermentzi, M. & Scheithauer, H.
- Funded by: European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
- Language(s) (NLP): The model was fine-tuned on cs, de, en, fr, hu, nl, pl, sk, yi data but it may work for more languages due to the use of a multilingual base model (XLM-R) with cross-lingual transfer capabilities.
- License: EUPL-1.2
- Finetuned from model: FacebookAI/xlm-roberta-large
Uses
This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby, upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and unlock new ways for archivists and researchers within the EHRI network to organize, analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.
Limitations
The dataset used to fine-tune this model stems from a series of manually annotated digital scholarly editions, the EHRI Online Editions. The original purpose of these editions was not to provide a dataset for training NER models, although we argue that they nevertheless constitute a high-quality resource that is suitable to be used in this way. However, users should still be mindful that our dataset repurposes a resource that was not built for purpose.
The fine-tuned model occasionally misclassifies entities as non-entity tokens, I-GHETTO being the most confused entity. The fine-tuned model occasionally encounters challenges in extracting multi-tokens entities, such as I-CAMP, I-LOC, and I-ORG, which are sometimes confused with the beginning of an entity. Moreover, it tends to misclassify B-GHETTO and B-CAMP as B-LOC, which is not surprising given that they are semantically close.
This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for the purposes of other users/organizations.