YuxinJiang commited on
Commit
78131e0
1 Parent(s): 2181da0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -93
README.md CHANGED
@@ -1,13 +1,37 @@
1
- ---
2
- license: mit
3
- ---
4
  # PromCSE: Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning
5
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
6
 
7
- arXiv link: https://arxiv.org/abs/2203.06875v2
8
- Published in [**EMNLP 2022**](https://2022.emnlp.org/)
9
 
10
- Our code is modified based on [SimCSE](https://github.com/princeton-nlp/SimCSE) and [P-tuning v2](https://github.com/THUDM/P-tuning-v2/). Here we would like to sincerely thank them for their excellent works. Our models acquires comparable results to [PromptBERT](https://github.com/kongds/Prompt-BERT) **without designing discrete prompts manually**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  We have released our supervised and unsupervised models on huggingface, which acquire **Top 1** results on 1 domain-shifted STS task and 4 standard STS tasks:
13
 
@@ -27,31 +51,73 @@ We have released our supervised and unsupervised models on huggingface, which ac
27
 
28
  <!-- <img src="https://github.com/YJiangcm/DCPCSE/blob/master/figure/leaderboard.png" width="700" height="380"> -->
29
 
 
30
  | Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg. |
31
  |:-----------------------:|:-----:|:----------:|:---------:|:-----:|:-----:|:-----:|:-----:|:-----:|
32
- | unsup-PromCSE-BERT-base ([huggingface](https://huggingface.co/YuxinJiang/unsup-promcse-bert-base-uncased)) | 73.03 |85.18| 76.70| 84.19 |79.69| 80.62| 70.00| 78.49|
33
- | sup-PromCSE-RoBERTa-base ([huggingface](https://huggingface.co/YuxinJiang/sup-promcse-roberta-base)) | 76.75 |85.86| 80.98| 86.51 |83.51| 86.58| 80.41| 82.94|
34
- | sup-PromCSE-RoBERTa-large ([huggingface](https://huggingface.co/YuxinJiang/sup-promcse-roberta-large)) | 79.14 |88.64| 83.73| 87.33 |84.57| 87.84| 82.07| 84.76|
35
 
36
- If you have any questions, feel free to raise an issue.
37
 
38
 
39
 
40
- ## Setups
 
41
 
42
- [![Python](https://img.shields.io/badge/python-3.8.2-blue?logo=python&logoColor=FED643)](https://www.python.org/downloads/release/python-382/)
43
- [![Pytorch](https://img.shields.io/badge/pytorch-1.7.1-red?logo=pytorch)](https://pytorch.org/get-started/previous-versions/)
44
 
45
- Run the following script to install the remaining dependencies,
 
 
46
 
 
47
  ```bash
48
- pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```
50
 
 
 
 
 
 
 
 
 
 
51
  ## Train PromCSE
52
 
53
  In the following section, we describe how to train a PromCSE model by using our code.
54
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ### Evaluation
57
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
@@ -180,84 +246,6 @@ All our experiments are conducted on Nvidia 3090 GPUs.
180
  | Valid steps | 125 | 125 | 125 | 125 |
181
 
182
 
183
- ## Usage
184
- We provide [tool.py](https://github.com/YJiangcm/PromCSE/blob/master/tool.py) which contains the following functions:
185
-
186
- **(1) encode sentences into embedding vectors;
187
- (2) compute cosine simiarities between sentences;
188
- (3) given queries, retrieval top-k semantically similar sentences for each query.**
189
-
190
- You can have a try by runing
191
- ```bash
192
- python tool.py \
193
- --model_name_or_path YuxinJiang/unsup-promcse-bert-base-uncased \
194
- --pooler_type cls_before_pooler \
195
- --pre_seq_len 16
196
- ```
197
-
198
- which is expected to output the following results.
199
- ```
200
- =========Calculate cosine similarities between queries and sentences============
201
-
202
- 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.18it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.26it/s][[0.5904227 0.70516586 0.65185255 0.82756 0.6969594 0.85966974
203
- 0.58715546 0.8467339 0.6583321 0.6792214 ]
204
- [0.6125869 0.73508096 0.61479807 0.6182762 0.6161849 0.59476817
205
- 0.595963 0.61386335 0.694822 0.938746 ]]
206
-
207
- =========Naive brute force search============
208
-
209
- 2022-10-09 11:59:06,004 : Encoding embeddings for sentences...
210
- 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.03it/s]2022-10-09 11:59:06,029 : Building index...
211
- 2022-10-09 11:59:06,029 : Finished
212
- 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 95.40it/s]100%|████████████████████████████████████████████████████████████████���██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.25it/s]Retrieval results for query: A man is playing music.
213
- A man plays the piano. (cosine similarity: 0.8597)
214
- A man plays a guitar. (cosine similarity: 0.8467)
215
- A man plays the violin. (cosine similarity: 0.8276)
216
- A woman is reading. (cosine similarity: 0.7051)
217
- A man is eating food. (cosine similarity: 0.6969)
218
- A woman is taking a picture. (cosine similarity: 0.6792)
219
- A woman is slicing a meat. (cosine similarity: 0.6583)
220
- A man is lifting weights in a garage. (cosine similarity: 0.6518)
221
-
222
- Retrieval results for query: A woman is making a photo.
223
- A woman is taking a picture. (cosine similarity: 0.9387)
224
- A woman is reading. (cosine similarity: 0.7351)
225
- A woman is slicing a meat. (cosine similarity: 0.6948)
226
- A man plays the violin. (cosine similarity: 0.6183)
227
- A man is eating food. (cosine similarity: 0.6162)
228
- A man is lifting weights in a garage. (cosine similarity: 0.6148)
229
- A man plays a guitar. (cosine similarity: 0.6139)
230
- An animal is biting a persons finger. (cosine similarity: 0.6126)
231
-
232
-
233
- =========Search with Faiss backend============
234
-
235
- 2022-10-09 11:59:06,055 : Loading faiss with AVX2 support.
236
- 2022-10-09 11:59:06,092 : Successfully loaded faiss with AVX2 support.
237
- 2022-10-09 11:59:06,093 : Encoding embeddings for sentences...
238
- 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.17it/s]2022-10-09 11:59:06,335 : Building index...
239
- 2022-10-09 11:59:06,335 : Use GPU-version faiss
240
- 2022-10-09 11:59:06,447 : Finished
241
- 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 101.44it/s]Retrieval results for query: A man is playing music.
242
- A man plays the piano. (cosine similarity: 0.8597)
243
- A man plays a guitar. (cosine similarity: 0.8467)
244
- A man plays the violin. (cosine similarity: 0.8276)
245
- A woman is reading. (cosine similarity: 0.7052)
246
- A man is eating food. (cosine similarity: 0.6970)
247
- A woman is taking a picture. (cosine similarity: 0.6792)
248
- A woman is slicing a meat. (cosine similarity: 0.6583)
249
- A man is lifting weights in a garage. (cosine similarity: 0.6519)
250
-
251
- Retrieval results for query: A woman is making a photo.
252
- A woman is taking a picture. (cosine similarity: 0.9387)
253
- A woman is reading. (cosine similarity: 0.7351)
254
- A woman is slicing a meat. (cosine similarity: 0.6948)
255
- A man plays the violin. (cosine similarity: 0.6183)
256
- A man is eating food. (cosine similarity: 0.6162)
257
- A man is lifting weights in a garage. (cosine similarity: 0.6148)
258
- A man plays a guitar. (cosine similarity: 0.6139)
259
- An animal is biting a persons finger. (cosine similarity: 0.6126)
260
- ```
261
 
262
 
263
  ## Citation
 
 
 
 
1
  # PromCSE: Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning
2
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
3
 
4
+ Our code is modified based on [SimCSE](https://github.com/princeton-nlp/SimCSE) and [P-tuning v2](https://github.com/THUDM/P-tuning-v2/). Here we would like to sincerely thank them for their excellent works.
 
5
 
6
+ **************************** **Updates** ****************************
7
+ * 2023/4/5: We released our sentence embedding [python package](#getting-started).
8
+ * 2022/3/3: We released a simple [colab notebook](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing) for a quick start!
9
+ * 2022/1/8: We released our model checkpoints on [huggingface](https://huggingface.co/YuxinJiang).
10
+ * 2022/10/9: We released the second verson of [our paper](https://arxiv.org/pdf/2203.06875v2.pdf). Check it out!
11
+ * 2022/10/6: Our paper has been accepted to [**EMNLP 2022**](https://2022.emnlp.org/).
12
+ * 2022/3/14: We released the first verson of [our paper](https://arxiv.org/pdf/2203.06875v1.pdf). Check it out!
13
+
14
+
15
+
16
+
17
+ ## Quick Links
18
+ - [Overview](#overview)
19
+ - [Model List](#model-list)
20
+ - [Usage](#usage)
21
+ - [Train PromCSE](#train-promcse)
22
+ - [Setups](#setups)
23
+ - [Evaluation](#evaluation)
24
+ - [Training](#training)
25
+ - [Citation](#citation)
26
+
27
+
28
+
29
+ ## Overview
30
+ <img src="https://github.com/YJiangcm/PromCSE/blob/master/figure/overview.jpg" width="700" height="320">
31
+
32
+
33
+
34
+ ## Model List
35
 
36
  We have released our supervised and unsupervised models on huggingface, which acquire **Top 1** results on 1 domain-shifted STS task and 4 standard STS tasks:
37
 
 
51
 
52
  <!-- <img src="https://github.com/YJiangcm/DCPCSE/blob/master/figure/leaderboard.png" width="700" height="380"> -->
53
 
54
+
55
  | Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg. |
56
  |:-----------------------:|:-----:|:----------:|:---------:|:-----:|:-----:|:-----:|:-----:|:-----:|
57
+ | [YuxinJiang/unsup-promcse-bert-base-uncased](https://huggingface.co/YuxinJiang/unsup-promcse-bert-base-uncased) | 73.03 |85.18| 76.70| 84.19 |79.69| 80.62| 70.00| 78.49|
58
+ | [YuxinJiang/sup-promcse-roberta-base](https://huggingface.co/YuxinJiang/sup-promcse-roberta-base) | 76.75 |85.86| 80.98| 86.51 |83.51| 86.58| 80.41| 82.94|
59
+ | [YuxinJiang/sup-promcse-roberta-large](https://huggingface.co/YuxinJiang/sup-promcse-roberta-large) | 79.14 |88.64| 83.73| 87.33 |84.57| 87.84| 82.07| 84.76|
60
 
61
+ **Naming rules**: `unsup` and `sup` represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.
62
 
63
 
64
 
65
+ ## Usage
66
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
67
 
68
+ We provide an easy-to-use python package `promcse` which contains the following functions:
 
69
 
70
+ **(1) encode sentences into embedding vectors;
71
+ (2) compute cosine simiarities between sentences;
72
+ (3) given queries, retrieval top-k semantically similar sentences for each query.**
73
 
74
+ To use the tool, first install the `promcse` package from [PyPI](https://pypi.org/project/promcse/)
75
  ```bash
76
+ pip install promcse
77
+ ```
78
+ After installing the package, you can load our model by two lines of code
79
+ ```python
80
+ from promcse import PromCSE
81
+ model = PromCSE("YuxinJiang/unsup-promcse-bert-base-uncased", "cls_before_pooler", 16)
82
+ # model = PromCSE("YuxinJiang/sup-promcse-roberta-base")
83
+ # model = PromCSE("YuxinJiang/sup-promcse-roberta-large")
84
+ ```
85
+
86
+ Then you can use our model for **encoding sentences into embeddings**
87
+ ```python
88
+ embeddings = model.encode("A woman is reading.")
89
+ ```
90
+
91
+ **Compute the cosine similarities** between two groups of sentences
92
+ ```python
93
+ sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
94
+ sentences_b = ['He plays guitar.', 'A woman is making a photo.']
95
+ similarities = model.similarity(sentences_a, sentences_b)
96
  ```
97
 
98
+ Or build index for a group of sentences and **search** among them
99
+ ```python
100
+ sentences = ['A woman is reading.', 'A man is playing a guitar.']
101
+ model.build_index(sentences)
102
+ results = model.search("He plays guitar.")
103
+ ```
104
+
105
+
106
+
107
  ## Train PromCSE
108
 
109
  In the following section, we describe how to train a PromCSE model by using our code.
110
 
111
+ ### Setups
112
+
113
+ [![Python](https://img.shields.io/badge/python-3.8.2-blue?logo=python&logoColor=FED643)](https://www.python.org/downloads/release/python-382/)
114
+ [![Pytorch](https://img.shields.io/badge/pytorch-1.7.1-red?logo=pytorch)](https://pytorch.org/get-started/previous-versions/)
115
+
116
+ Run the following script to install the remaining dependencies,
117
+
118
+ ```bash
119
+ pip install -r requirements.txt
120
+ ```
121
 
122
  ### Evaluation
123
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
 
246
  | Valid steps | 125 | 125 | 125 | 125 |
247
 
248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
 
250
 
251
  ## Citation