xyla commited on
Commit
71b3e44
·
1 Parent(s): e7a2ded

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -8
README.md CHANGED
@@ -4,25 +4,38 @@ license: mit
4
 
5
  # Clinical-T5 Models
6
  We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
7
- (2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.
8
 
9
- This particular model card describes the T5-Large model trained from scratch on MIMIC notes.
10
 
11
  # Model Pretraining
12
  In this section, we will describe the pretraining procedure.
13
 
14
  ### Pretraining Data
15
-
16
-
17
 
18
  ### Note Preprocessing
19
-
 
 
 
20
  ### Pretraining Procedures
21
-
22
- ### Pretraining Hyperparameters
23
 
24
  # How to use the Model
25
-
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  # Questions?
28
  If you have any questions about using the models, please email [email protected].
 
4
 
5
  # Clinical-T5 Models
6
  We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
7
+ (2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.
8
 
9
+ This particular model card describes the T5-Large model trained from scratch on MIMIC notes.
10
 
11
  # Model Pretraining
12
  In this section, we will describe the pretraining procedure.
13
 
14
  ### Pretraining Data
15
+ We train on the union of MIMIC-III and MIMIC-IV. MIMIC-III contains a wide variety of note types, whereas MIMIC-IV contains only radiology reports and discharge summaries. We remove duplicate notes. This results in ~1.2B words.
 
16
 
17
  ### Note Preprocessing
18
+ We make two important preprocessing steps:
19
+ * We replace all DEID tags with special tokens. For example, `"The patient, [**First Name 123**], has a history of high blood pressure"` is replaced with `"The patient, [NAME], has a history of high blood pressure"`.
20
+ * We remove any duplicate notes based on edit times. There are roughly ~300M/800M words from MIMIC-III, which are repeats of the same note, with only a few words changed! This is due to the fact that a nurse might save a note, and then edit it 10 minutes later. Both would appear.
21
+
22
  ### Pretraining Procedures
23
+ We train the Clinical-T5-Large model from scratch using a cased-vocab of 32,000. We train it for 780,000 steps, using a batch size of 12 per TPU pod (8 pods total), and a sequence length of 512.
24
+ This results in a batch size of 49,152. Accounting for the number of steps
25
 
26
  # How to use the Model
27
+ You will first need to have credentialed PhysioNet access to use model. Why? There is reasonable evidence that these models contain leakage, especially the larger ones. Releasing a model that leaks these notes would be a data-use agreement violation. To get PhysioNet access, you must pass the CITI training.
28
+ Once you have PhysioNet, access the model by doing the following:
29
+ ```
30
+ wget -r -N -c -np --user "INSERT_USER" --ask-password https://physionet.org/files/clinical-t5/1.0.0/
31
+ ```
32
+
33
+ Then, you can load the model + tokenizer:
34
+ ```
35
+ from transformers import AutoTokenizer, AutoModel
36
+ tokenizer = AutoTokenizer.from_pretrained(INSERT_PATH_TO_MODEL_FOLDER)
37
+ model = AutoModel.from_pretrained(PATH_TO_MODEL_FOLDER)
38
+ ```
39
 
40
  # Questions?
41
  If you have any questions about using the models, please email [email protected].