fl399 commited on
Commit
dc1cc61
·
1 Parent(s): 54d1db8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md CHANGED
@@ -13,6 +13,41 @@ datasets:
13
  ### SapBERT-PubMedBERT
14
  SapBERT by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). Trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AA (English only), using [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) as the base model. Please use the mean-pooling of the output as the representation.
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ### Citation
17
  ```bibtex
18
  @inproceedings{liu-etal-2021-self,
 
13
  ### SapBERT-PubMedBERT
14
  SapBERT by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). Trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AA (English only), using [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) as the base model. Please use the mean-pooling of the output as the representation.
15
 
16
+
17
+ #### Extracting embeddings from SapBERT
18
+
19
+ The following script converts a list of strings (entity names) into embeddings.
20
+ ```python
21
+ import numpy as np
22
+ import torch
23
+ from tqdm.auto import tqdm
24
+ from transformers import AutoTokenizer, AutoModel
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
27
+ model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
28
+
29
+ # replace with your own list of entity names
30
+ all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]
31
+
32
+ bs = 128 # batch size during inference
33
+ all_embs = []
34
+ for i in tqdm(np.arange(0, len(all_names), bs)):
35
+ toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
36
+ padding="max_length",
37
+ max_length=25,
38
+ truncation=True,
39
+ return_tensors="pt")
40
+ toks_cuda = {}
41
+ for k,v in toks.items():
42
+ toks_cuda[k] = v.cuda()
43
+ cls_rep = model(**toks_cuda)[0].mean(1)# use mean pooling representation as the embedding
44
+ all_embs.append(cls_rep.cpu().detach().numpy())
45
+
46
+ all_embs = np.concatenate(all_embs, axis=0)
47
+ ```
48
+
49
+ For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
50
+
51
  ### Citation
52
  ```bibtex
53
  @inproceedings{liu-etal-2021-self,