File size: 2,216 Bytes
243948c
23f7c11
 
 
 
 
0f1d42a
 
 
 
 
243948c
 
 
0f1d42a
243948c
0f1d42a
243948c
0f1d42a
243948c
0f1d42a
243948c
0f1d42a
243948c
 
0f1d42a
243948c
0f1d42a
 
 
243948c
0f1d42a
 
 
243948c
0f1d42a
 
 
 
 
 
 
243948c
 
0f1d42a
243948c
6d97875
243948c
6d97875
 
 
 
 
 
 
 
 
 
0f1d42a
 
 
 
 
 
 
6d97875
0f1d42a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
base_model: BAAI/bge-reranker-v2-m3
language:
- en
- ru
license: mit
pipeline_tag: text-classification
tags:
- transformers
- sentence-transformers
- text-embeddings-inference
---


# Model for English and Russian

This is a truncated version of [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3).

This model has only English and Russian tokens left in the vocabulary. Thus making it 1.5 smaller than the original model while producing the same embeddings.

The model has been truncated in [this notebook](https://colab.research.google.com/drive/19IFjWpJpxQie1gtHSvDeoKk7CQtpy6bT?usp=sharing). 

## FAQ


### Generate Scores for text

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('qilowoq/bge-reranker-v2-m3-en-ru')
model = AutoModelForSequenceClassification.from_pretrained('qilowoq/bge-reranker-v2-m3-en-ru')
model.eval()

pairs = [('How many people live in Berlin?', 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'),
         ('Какая площадь Берлина?', 'Площадь Берлина составляет 891,8 квадратных километров.')]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)
```


## Citation

If you find this repository useful, please consider giving a star and citation

```bibtex
@misc{li2023making,
      title={Making Large Language Models A Better Foundation For Dense Retrieval}, 
      author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
      year={2023},
      eprint={2312.15503},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{chen2024bge,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
```