xyla
/

Clinical-T5-Large

Model card Files Files and versions Community

xyla commited on Jan 25, 2023

Commit

71b3e44

·

1 Parent(s): e7a2ded

Update README.md

Files changed (1) hide show

README.md +21 -8

README.md CHANGED Viewed

@@ -4,25 +4,38 @@ license: mit
 # Clinical-T5 Models
 We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
-(2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.
-This particular model card describes the T5-Large model trained from scratch on MIMIC notes.
 # Model Pretraining
 In this section, we will describe the pretraining procedure.
 ### Pretraining Data
 ### Note Preprocessing
 ### Pretraining Procedures
-### Pretraining Hyperparameters
 # How to use the Model
 # Questions?
 If you have any questions about using the models, please email [email protected].

 # Clinical-T5 Models
 We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
+(2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.
+This particular model card describes the T5-Large model trained from scratch on MIMIC notes.
 # Model Pretraining
 In this section, we will describe the pretraining procedure.
 ### Pretraining Data
+We train on the union of MIMIC-III and MIMIC-IV. MIMIC-III contains a wide variety of note types, whereas MIMIC-IV contains only radiology reports and discharge summaries. We remove duplicate notes. This results in ~1.2B words.
 ### Note Preprocessing
+We make two important preprocessing steps:
+* We replace all DEID tags with special tokens. For example, `"The patient, [**First Name 123**], has a history of high blood pressure"` is replaced with `"The patient, [NAME], has a history of high blood pressure"`.
+* We remove any duplicate notes based on edit times. There are roughly ~300M/800M words from MIMIC-III, which are repeats of the same note, with only a few words changed! This is due to the fact that a nurse might save a note, and then edit it 10 minutes later. Both would appear.
 ### Pretraining Procedures
+We train the Clinical-T5-Large model from scratch using a cased-vocab of 32,000. We train it for 780,000 steps, using a batch size of 12 per TPU pod (8 pods total), and a sequence length of 512.
+This results in a batch size of 49,152. Accounting for the number of steps
 # How to use the Model
+You will first need to have credentialed PhysioNet access to use model. Why? There is reasonable evidence that these models contain leakage, especially the larger ones. Releasing a model that leaks these notes would be a data-use agreement violation. To get PhysioNet access, you must pass the CITI training.
+Once you have PhysioNet, access the model by doing the following:
+```
+wget -r -N -c -np --user "INSERT_USER" --ask-password https://physionet.org/files/clinical-t5/1.0.0/
+```
+Then, you can load the model + tokenizer:
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained(INSERT_PATH_TO_MODEL_FOLDER)
+model = AutoModel.from_pretrained(PATH_TO_MODEL_FOLDER)
+```
 # Questions?
 If you have any questions about using the models, please email [email protected].