File size: 5,919 Bytes

---
language:
- vi
license: apache-2.0
library_name: transformers
tags:
- transformers
- cross-encoder
- rerank
datasets:
- unicamp-dl/mmarco
pipeline_tag: text-classification
widget:
- text: tỉnh nào có diện tích lớn nhất việt nam
  output:
  - label: nghệ an có diện tích lớn nhất việt nam
    score: 0.9999
  - label: bắc ninh có diện tích nhỏ nhất việt nam
    score: 0.1705
---

# Reranker

* [Usage](#usage)
    * [Using FlagEmbedding](#using-flagembedding)
    * [Using Huggingface transformers](#using-huggingface-transformers)
* [Fine tune](#fine-tune)
    * [Data format](#data-format)
* [Performance](#performance)
* [Citation](#citation)

Different from embedding model, reranker uses question and document as input and directly output similarity instead of
embedding.
You can get a relevance score by inputting query and passage to the reranker.
And the score can be mapped to a float value in [0,1] by sigmoid function.

## Usage

### Using FlagEmbedding

```
pip install -U FlagEmbedding
```

Get relevance scores (higher scores indicate more relevance):

```python
from FlagEmbedding import FlagReranker

reranker = FlagReranker('namdp-ptit/ViRanker',
                        use_fp16=True)  # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'])
print(score)  # 11.140625

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
                               normalize=True)
print(score)  # 0.9999854895214452

scores = reranker.compute_score(
    [
        ['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
        ['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
    ]
)
print(scores)  # [11.140625, -1.58203125]

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score(
    [
        ['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
        ['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
    ],
    normalize=True
)
print(scores)  # [0.99998548952144523, 0.17050799982688053]
```

### Using Huggingface transformers

```
pip install -U transformers
```

Get relevance scores (higher scores indicate more relevance):

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('namdp-ptit/ViRanker')
model = AutoModelForSequenceClassification.from_pretrained('namdp-ptit/ViRanker')
model.eval()

pairs = [
    ['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
    ['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
],
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)
```

## Fine-tune

### Data Format

Train data should be a json file, where each line is a dict like this:

```
{"query": str, "pos": List[str], "neg": List[str]}
```

`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. If you have no negative
texts for a query, you can random sample some from the entire corpus as the negatives.

## Performance

Below is a comparision table of the results we achieved compared to some other pre-trained Cross-Encoders on
the [MS MMarco Passage Reranking - Vi - Dev](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

| Model-Name                                                                                                                              | NDCG@3     | MRR@3      | NDCG@5     | MRR@5      | NDCG@10    | MRR@10     | Docs / Sec |
|-----------------------------------------------------------------------------------------------------------------------------------------|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|
| [namdp-ptit/ViRanker](https://huggingface.co/namdp-ptit/ViRanker)                                                                       | **0.6685** | **0.6564** | 0.6842     | **0.6811** | 0.7278     | **0.6985** | 2.02       
| [itdainb/PhoRanker](https://huggingface.co/itdainb/PhoRanker)                                                                           | 0.6625     | 0.6458     | **0.7147** | 0.6731     | **0.7422** | 0.6830     | **15**     
| [kien-vu-uet/finetuned-phobert-passage-rerank-best-eval](https://huggingface.co/kien-vu-uet/finetuned-phobert-passage-rerank-best-eval) | 0.0963     | 0.0883     | 0.1396     | 0.1131     | 0.1681     | 0.1246     | **15**     
| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)                                                               | 0.6087     | 0.5841     | 0.6513     | 0.6062     | 0.6872     | 0.62091    | 3.51       
| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma)                                                         | 0.6088     | 0.5908     | 0.6446     | 0.6108     | 0.6785     | 0.6249     | 1.29       

## Citation

Please cite as

```Plaintext
@misc{ViRanker,
  title={ViRanker: A Cross-encoder Model for Vietnamese Text Ranking},
  author={Nam Dang Phuong},
  year={2024},
  publisher={Huggingface},
}
```