|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
This model is a merged model, using [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5), [GIST-Embedding-v0](https://huggingface.co/avsolatorio/GIST-Embedding-v0) and [gte-base](https://huggingface.co/thenlper/gte-base). |
|
This model focuses on retrieval tasks while also performing well on various tasks (See experiment details below). |
|
|
|
## Usage |
|
|
|
## For retrieval tasks |
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
# Sentences we want sentence embeddings for |
|
token="" |
|
sentences = ["this is a test sentence", "this is another test sentence"] |
|
|
|
# Prefixing for retrieval tasks |
|
instruction = "Represent this sentence for searching relevant passages: " |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('Marqo/marqo-merged-bge-gist-gte-base', token=token) |
|
model = AutoModel.from_pretrained('Marqo/marqo-merged-bge-gist-gte-base', token=token) |
|
model.eval() |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
encoded_input_with_prefixing = tokenizer([instruction + q for q in sentences], padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
model_output_with_prefixing = model(**encoded_input_with_prefixing) |
|
sentence_embeddings = model_output[0][:, 0] |
|
sentence_embeddings_with_prefixing = model_output_with_prefixing[0][:, 0] |
|
|
|
sentence_embeddings_avg = (sentence_embeddings + sentence_embeddings_with_prefixing) / 2 |
|
|
|
# normalize embeddings |
|
sentence_embeddings_avg = torch.nn.functional.normalize(sentence_embeddings_avg, p=2, dim=1) |
|
print("Sentence embeddings:", sentence_embeddings_avg) |
|
``` |
|
## Evaluation |
|
<img src="slerp.png" alt="109M models retrieval benchmarks" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/> |
|
|
|
| Models | Average | ArguAna | ClimateFEVER | CQADupstackRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID | |
|
| ------ | ------- | ------- | ------------ | -------------------- | ------- | ----- | -------- | -------- | ------- | -------- | -- | -------------- | ------- | ------- | ---------- | --------- | |
|
| Slerp (average prefixing) | 53.37 | 63.45 | 31.66 | 42.51 | 42.15 | 86.53 | 43.28 | 73.72 | 39.16 | 37.66 | 54.39 | 88.96 | 22.7 | 76.66 | 21.46 | 76.23 | |
|
| Slerp (prefixing) | 53.18 | 63.58 | 30.67 | 43.23 | 41.52 | 86.54 | 41.28 | 71.43 | 41.16 | 38.01 | 55.28 | 88.72 | 22.75 | 75.05 | 22.15 | 76.27 | |
|
| BGE | 53.25 | 63.61 | 31.17 | 42.35 | 40.77 | 86.29 | 40.65 | 72.6 | 41.35 | 37.39 | 54.15 | 88.9 | 21.73 | 74.04 | 25.7 | 78.07 | |
|
| GIST | 52.31 | 62.62 | 31.49 | 43.2 | 41.7 | 86.65 | 40.64 | 68.92 | 40.64 | 37.64 | 53.43 | 88.81 | 23.47 | 75.29 | 20.58 | 69.6 | |
|
| GTE | 51.14 | 57.12 | 28.1 | 42.91 | 41.19 | 81.52 | 40.76 | 65.75 | 40.21 | 37.9 | 52.84 | 88.15 | 23.13 | 76.18 | 22.55 | 68.78 | |