johngiorgi commited on
Commit
2e95b6e
1 Parent(s): c3b3309

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeCLUTR-sci-base
2
+
3
+ ## Model description
4
+
5
+ This is the (allenai/scibert_scivocab_uncased)[https://huggingface.co/allenai/scibert_scivocab_uncased] model, with extended pretraining on 2.5 million scientific papers from [S2ORC](https://github.com/allenai/s2orc/) using the self-supervised training strategy presented in [DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations](https://arxiv.org/abs/2006.03659)..
6
+
7
+ ## Intended uses & limitations
8
+
9
+ The model is intended to be used as a sentence encoder, similar to [Google's Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) or [Sentence Transformers](https://github.com/UKPLab/sentence-transformers). It is particularly suitable for scientific text.
10
+
11
+ #### How to use
12
+
13
+ Please see [our repo](https://github.com/JohnGiorgi/DeCLUTR) for full details. A simple example is shown below.
14
+
15
+ ```python
16
+ import torch
17
+ from scipy.spatial.distance import cosine
18
+
19
+ from transformers import AutoModel, AutoTokenizer
20
+
21
+ # Load the model
22
+ tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-base")
23
+ model = AutoModel.from_pretrained("johngiorgi/declutr-base")
24
+
25
+ # Prepare some text to embed
26
+ text = [
27
+ "A smiling costumed woman is holding an umbrella.",
28
+ "A happy woman in a fairy costume holds an umbrella.",
29
+ ]
30
+ inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
31
+
32
+ # Embed the text
33
+ with torch.no_grad():
34
+ sequence_output, _ = model(**inputs, output_hidden_states=False)
35
+
36
+ # Mean pool the token-level embeddings to get sentence-level embeddings
37
+ embeddings = torch.sum(
38
+ sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
39
+ ) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)
40
+
41
+ # Compute a semantic similarity via the cosine distance
42
+ semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
43
+ ```
44
+
45
+ ### BibTeX entry and citation info
46
+
47
+ ```bibtex
48
+ @article{Giorgi2020DeCLUTRDC,
49
+ title={DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations},
50
+ author={John M Giorgi and Osvald Nitski and Gary D. Bader and Bo Wang},
51
+ journal={ArXiv},
52
+ year={2020},
53
+ volume={abs/2006.03659}
54
+ }
55
+ ```