Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,84 @@
|
|
1 |
---
|
2 |
license: gpl-3.0
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: gpl-3.0
|
3 |
+
language:
|
4 |
+
- nl
|
5 |
+
pipeline_tag: token-classification
|
6 |
+
tags:
|
7 |
+
- medical
|
8 |
---
|
9 |
+
|
10 |
+
|
11 |
+
# MedRoBERTa.nl finetuned for experiencer
|
12 |
+
|
13 |
+
## Description
|
14 |
+
This model is a finetuned RoBERTa-based model pre-trained from scratch
|
15 |
+
on Dutch hospital notes sourced from Electronic Health Records.
|
16 |
+
All code used for the creation of MedRoBERTa.nl
|
17 |
+
can be found at https://github.com/cltl-students/verkijk_stella_rma_thesis_dutch_medical_language_model.
|
18 |
+
The publication associated with the negation detection task can be found at https://arxiv.org/abs/2209.00470.
|
19 |
+
The code for finetuning the model can be found at https://github.com/umcu/negation-detection.
|
20 |
+
|
21 |
+
|
22 |
+
## Minimal example
|
23 |
+
|
24 |
+
```python
|
25 |
+
tokenizer = AutoTokenizer\
|
26 |
+
.from_pretrained("UMCU/MedRoBERTa.nl_Experiencer")
|
27 |
+
model = AutoModelForTokenClassification\
|
28 |
+
.from_pretrained("UMCU/MedRoBERTa.nl_Experiencer")
|
29 |
+
|
30 |
+
some_text = "De patient was niet aanspreekbaar en hij zag er grauw uit. \
|
31 |
+
Hij heeft de inspanningstest echter goed doorstaan. \
|
32 |
+
De broer heeft onlangs een operatie ondergaan."
|
33 |
+
|
34 |
+
inputs = tokenizer(some_text, return_tensors='pt')
|
35 |
+
output = model.forward(inputs)
|
36 |
+
probas = torch.nn.functional.softmax(output.logits[0]).detach().numpy()
|
37 |
+
|
38 |
+
# associate with tokens
|
39 |
+
input_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
|
40 |
+
target_map = {0: 'B-Patient', 1:'B-Other',2:'I-Patient',3:'I-Other'}
|
41 |
+
results = [{'token': input_tokens[idx],
|
42 |
+
'proba_patient': proba_arr[0]+proba_arr[2],
|
43 |
+
'proba_other': proba_arr[1]+proba_arr[3]
|
44 |
+
}
|
45 |
+
for idx,proba_arr in enumerate(probas)]
|
46 |
+
|
47 |
+
```
|
48 |
+
|
49 |
+
The medical entity classifiers are (being) integrated in the opensource library [clinlp](https://github.com/umcu/clinlp), feel free to contact
|
50 |
+
us for access, either through Huggingface or through git.
|
51 |
+
|
52 |
+
It is perhaps good to note that we assume the [Inside-Outside-Beginning](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format.
|
53 |
+
|
54 |
+
## Intended use
|
55 |
+
The model is finetuned for experiencer detection on Dutch clinical text.
|
56 |
+
Since it is a domain-specific model trained on medical data,
|
57 |
+
it is meant to be used on medical NLP tasks for Dutch.
|
58 |
+
This particular model is trained on a 64-max token windows surrounding the concept-to-be negated.
|
59 |
+
|
60 |
+
## Data
|
61 |
+
The pre-trained model was trained on nearly 10 million hospital notes from the Amsterdam University Medical Centres.
|
62 |
+
The training data was anonymized before starting the pre-training procedure.
|
63 |
+
|
64 |
+
The finetuning was performed on the Erasmus Dutch Clinical Corpus (EDCC), which was synthetically upsampled for the minority classses.
|
65 |
+
The EDCC can be obtained through Jan Kors ([email protected]).
|
66 |
+
The EDCC is described here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0373-3
|
67 |
+
|
68 |
+
## Authors
|
69 |
+
|
70 |
+
MedRoBERTa.nl: Stella Verkijk, Piek Vossen,
|
71 |
+
Finetuning: Bram van Es
|
72 |
+
|
73 |
+
## Contact
|
74 |
+
|
75 |
+
If you are having problems with this model please add an issue on our git: https://github.com/umcu/negation-detection/issues
|
76 |
+
|
77 |
+
## Usage
|
78 |
+
|
79 |
+
If you use the model in your work please use the following referral; and (paper) https://doi.org/10.1186/s12859-022-05130-x
|
80 |
+
|
81 |
+
## References
|
82 |
+
Paper: Verkijk, S. & Vossen, P. (2022) MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records. Computational Linguistics in the Netherlands Journal, 11.
|
83 |
+
|
84 |
+
Paper: Bram van Es, Leon C. Reteig, Sander C. Tan, Marijn Schraagen, Myrthe M. Hemker, Sebastiaan R.S. Arends, Miguel A.R. Rios, Saskia Haitjema (2022): Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods, Arxiv
|