Emmytheo
/

DiagBERT

@@ -6,29 +6,46 @@ tags:
 - clinical
 thumbnail: https://core.app.datexis.com/static/paper.png
 pipeline_tag: text-classification
 ---
-# CORe Model - BioBERT + Clinical Outcome Pre-Training
 ## Model description
 The CORe (_Clinical Outcome Representations_) model is introduced in the paper [Clinical Outcome Predictions from Admission Notes using Self-Supervised Knowledge Integration](https://www.aclweb.org/anthology/2021.eacl-main.75.pdf).
 It is based on BioBERT and further pre-trained on clinical notes, disease descriptions and medical articles with a specialised _Clinical Outcome Pre-Training_ objective.
-#### How to use CORe
 You can load the model via the transformers library:
 ```
-from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("bvanaken/CORe-clinical-outcome-biobert-v1")
-model = AutoModel.from_pretrained("bvanaken/CORe-clinical-outcome-biobert-v1")
 ```
-From there, you can fine-tune it on clinical tasks that benefit from patient outcome knowledge.
-### Pre-Training Data
-The model is based on [BioBERT](https://huggingface.co/dmis-lab/biobert-v1.1) pre-trained on PubMed data.
-The _Clinical Outcome Pre-Training_ included discharge summaries from the MIMIC III training set (specified [here](https://github.com/bvanaken/clinical-outcome-prediction/blob/master/tasks/mimic_train.csv)), medical transcriptions from [MTSamples](https://mtsamples.com/) and clinical notes from the i2b2 challenges 2006-2012. It  further includes ~10k case reports from PubMed Central (PMC), disease articles from Wikipedia and article sections from the [MedQuAd](https://github.com/abachaa/MedQuAD) dataset extracted from NIH websites.
 ### More Information

 - clinical
 thumbnail: https://core.app.datexis.com/static/paper.png
 pipeline_tag: text-classification
+widget:
+- text: "Patient with hypertension presents to ICU."
 ---
+# CORe Model - Clinical Diagnosis Prediction
 ## Model description
 The CORe (_Clinical Outcome Representations_) model is introduced in the paper [Clinical Outcome Predictions from Admission Notes using Self-Supervised Knowledge Integration](https://www.aclweb.org/anthology/2021.eacl-main.75.pdf).
 It is based on BioBERT and further pre-trained on clinical notes, disease descriptions and medical articles with a specialised _Clinical Outcome Pre-Training_ objective.
+This model checkpoint is **fine-tuned on the task of diagnosis prediction**.
+The model expects patient admission notes as input and outputs multi-label ICD9-code predictions.
+#### Model Predictions
+The model makes predictions on a total of 9237 labels. These contain 3- and 4-digit ICD9 codes and textual descriptions of these codes. The 4-digit codes and textual descriptions help to incorporate further topical and hierarchical information into the model during training (see Section 4.2 _ICD+: Incorporation of ICD Hierarchy_ in our paper). We recommend to only use the **3-digit code predictions at inference time**, because only those have been evaluated in our work.
+#### How to use CORe Diagnosis Prediction
 You can load the model via the transformers library:
 ```
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("bvanaken/CORe-clinical-diagnosis-prediction")
+model = AutoModelForSequenceClassification.from_pretrained("bvanaken/CORe-clinical-diagnosis-prediction")
 ```
+The following code shows an inference example:
+```
+input = "CHIEF COMPLAINT: Headaches\n\nPRESENT ILLNESS: 58yo man w/ hx of hypertension, AFib on coumadin presented to ED with the worst headache of his life."
+tokenized_input = tokenizer(input, return_tensors="pt")
+output = model(**tokenized_input)
+import torch
+predictions = torch.sigmoid(output.logits)
+predicted_labels = [model.config.id2label[_id] for _id in (predictions > 0.3).nonzero()[:, 1].tolist()]
+```
+Note: For the best performance, we recommend to determine the thresholds (0.3 in this example) individually per label.
 ### More Information