Update README.md
Browse files
README.md
CHANGED
@@ -4,25 +4,38 @@ license: mit
|
|
4 |
|
5 |
# Clinical-T5 Models
|
6 |
We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
|
7 |
-
(2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.
|
8 |
|
9 |
-
This particular model card describes the T5-Large model trained from scratch on MIMIC notes.
|
10 |
|
11 |
# Model Pretraining
|
12 |
In this section, we will describe the pretraining procedure.
|
13 |
|
14 |
### Pretraining Data
|
15 |
-
|
16 |
-
|
17 |
|
18 |
### Note Preprocessing
|
19 |
-
|
|
|
|
|
|
|
20 |
### Pretraining Procedures
|
21 |
-
|
22 |
-
|
23 |
|
24 |
# How to use the Model
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
# Questions?
|
28 |
If you have any questions about using the models, please email [email protected].
|
|
|
4 |
|
5 |
# Clinical-T5 Models
|
6 |
We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
|
7 |
+
(2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.
|
8 |
|
9 |
+
This particular model card describes the T5-Large model trained from scratch on MIMIC notes.
|
10 |
|
11 |
# Model Pretraining
|
12 |
In this section, we will describe the pretraining procedure.
|
13 |
|
14 |
### Pretraining Data
|
15 |
+
We train on the union of MIMIC-III and MIMIC-IV. MIMIC-III contains a wide variety of note types, whereas MIMIC-IV contains only radiology reports and discharge summaries. We remove duplicate notes. This results in ~1.2B words.
|
|
|
16 |
|
17 |
### Note Preprocessing
|
18 |
+
We make two important preprocessing steps:
|
19 |
+
* We replace all DEID tags with special tokens. For example, `"The patient, [**First Name 123**], has a history of high blood pressure"` is replaced with `"The patient, [NAME], has a history of high blood pressure"`.
|
20 |
+
* We remove any duplicate notes based on edit times. There are roughly ~300M/800M words from MIMIC-III, which are repeats of the same note, with only a few words changed! This is due to the fact that a nurse might save a note, and then edit it 10 minutes later. Both would appear.
|
21 |
+
|
22 |
### Pretraining Procedures
|
23 |
+
We train the Clinical-T5-Large model from scratch using a cased-vocab of 32,000. We train it for 780,000 steps, using a batch size of 12 per TPU pod (8 pods total), and a sequence length of 512.
|
24 |
+
This results in a batch size of 49,152. Accounting for the number of steps
|
25 |
|
26 |
# How to use the Model
|
27 |
+
You will first need to have credentialed PhysioNet access to use model. Why? There is reasonable evidence that these models contain leakage, especially the larger ones. Releasing a model that leaks these notes would be a data-use agreement violation. To get PhysioNet access, you must pass the CITI training.
|
28 |
+
Once you have PhysioNet, access the model by doing the following:
|
29 |
+
```
|
30 |
+
wget -r -N -c -np --user "INSERT_USER" --ask-password https://physionet.org/files/clinical-t5/1.0.0/
|
31 |
+
```
|
32 |
+
|
33 |
+
Then, you can load the model + tokenizer:
|
34 |
+
```
|
35 |
+
from transformers import AutoTokenizer, AutoModel
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained(INSERT_PATH_TO_MODEL_FOLDER)
|
37 |
+
model = AutoModel.from_pretrained(PATH_TO_MODEL_FOLDER)
|
38 |
+
```
|
39 |
|
40 |
# Questions?
|
41 |
If you have any questions about using the models, please email [email protected].
|