metadata

license: gpl-3.0
language:
  - nl
pipeline_tag: token-classification
tags:
  - medical

MedRoBERTa.nl finetuned for experiencer

Description

This model is a finetuned RoBERTa-based model pre-trained from scratch on Dutch hospital notes sourced from Electronic Health Records. All code used for the creation of MedRoBERTa.nl can be found at https://github.com/cltl-students/verkijk_stella_rma_thesis_dutch_medical_language_model. The publication associated with the negation detection task can be found at https://arxiv.org/abs/2209.00470. The code for finetuning the model can be found at https://github.com/umcu/negation-detection.

Minimal example

tokenizer = AutoTokenizer\
             .from_pretrained("UMCU/MedRoBERTa.nl_Experiencer")
model = AutoModelForTokenClassification\
            .from_pretrained("UMCU/MedRoBERTa.nl_Experiencer")

some_text = "De patient was niet aanspreekbaar en hij zag er grauw uit. \
Hij heeft de inspanningstest echter goed doorstaan. \
De broer heeft onlangs een operatie ondergaan."

inputs = tokenizer(some_text, return_tensors='pt')
output = model.forward(inputs)
probas = torch.nn.functional.softmax(output.logits[0]).detach().numpy()

#  associate with tokens
input_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
target_map = {0: 'B-Patient', 1:'B-Other',2:'I-Patient',3:'I-Other'}
results = [{'token': input_tokens[idx],
                 'proba_patient': proba_arr[0]+proba_arr[2],
                 'proba_other': proba_arr[1]+proba_arr[3]
                 }  
                 for idx,proba_arr in enumerate(probas)]

The medical entity classifiers are (being) integrated in the opensource library clinlp, feel free to contact us for access, either through Huggingface or through git.

It is perhaps good to note that we assume the Inside-Outside-Beginning format.

Intended use

The model is finetuned for experiencer detection on Dutch clinical text. Since it is a domain-specific model trained on medical data, it is meant to be used on medical NLP tasks for Dutch. This particular model is trained on a 64-max token windows surrounding the concept-to-be labeled.

Data

The pre-trained model was trained on nearly 10 million hospital notes from the Amsterdam University Medical Centres. The training data was anonymized before starting the pre-training procedure.

The finetuning was performed on the Erasmus Dutch Clinical Corpus (EDCC), which was synthetically upsampled for the minority classses. The EDCC can be obtained through Jan Kors ([email protected]). The EDCC is described here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0373-3

Authors

MedRoBERTa.nl: Stella Verkijk, Piek Vossen, Finetuning: Bram van Es

Contact

If you are having problems with this model please add an issue on our git: https://github.com/umcu/negation-detection/issues

Usage

If you use the model in your work please use the following referral; https://doi.org/10.1186/s12859-022-05130-x

References

Paper: Verkijk, S. & Vossen, P. (2022) MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records. Computational Linguistics in the Netherlands Journal, 11.

Paper: Bram van Es, Leon C. Reteig, Sander C. Tan, Marijn Schraagen, Myrthe M. Hemker, Sebastiaan R.S. Arends, Miguel A.R. Rios, Saskia Haitjema (2022): Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods, Arxiv