metadata

license: apache-2.0
language:
  - vi
library_name: transformers
pipeline_tag: text-classification
tags:
  - transformers
  - cross-encoder
  - rerank
datasets:
  - unicamp-dl/mmarco
widget:
  - text: tỉnh nào có diện tích lớn nhất việt nam
    output:
      - label: nghệ an có diện tích lớn nhất việt nam
        score: 0.9999
      - label: bắc ninh có diện tích nhỏ nhất việt nam
        score: 0.1705

Reranker

Usage
- Using FlagEmbedding
- Using Huggingface transformers
Fine tune
- Data format
Performance
Citation

Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. And the score can be mapped to a float value in [0,1] by sigmoid function.

Usage

Using FlagEmbedding

pip install -U FlagEmbedding

Get relevance scores (higher scores indicate more relevance):

from FlagEmbedding import FlagReranker

reranker = FlagReranker('namdp-ptit/ViRanker',
                        use_fp16=True)  # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'])
print(score)  # 11.140625

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
                               normalize=True)
print(score)  # 0.9999854895214452

scores = reranker.compute_score(
    [
        ['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
        ['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
    ]
)
print(scores)  # [11.140625, -1.58203125]

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score(
    [
        ['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
        ['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
    ],
    normalize=True
)
print(scores)  # [0.99998548952144523, 0.17050799982688053]

Using Huggingface transformers

pip install -U transformers

Get relevance scores (higher scores indicate more relevance):

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('namdp-ptit/ViRanker')
model = AutoModelForSequenceClassification.from_pretrained('namdp-ptit/ViRanker')
model.eval()

pairs = [
    ['tỉnh nào có diện tích lớn nhất việt nam', 'nghệ an có diện tích lớn nhất việt nam'],
    ['tỉnh nào có diện tích lớn nhất việt nam', 'bắc ninh có diện tích nhỏ nhất việt nam']
],
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

Fine-tune

Data Format

Train data should be a json file, where each line is a dict like this:

{"query": str, "pos": List[str], "neg": List[str]}

query is the query, and pos is a list of positive texts, neg is a list of negative texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.

Performance

Below is a comparision table of the results we achieved compared to some other pre-trained Cross-Encoders on the MS MMarco Passage Reranking - Vi - Dev dataset.

Model-Name	NDCG@3	MRR@3	NDCG@5	MRR@5	NDCG@10	MRR@10	Docs / Sec
namdp-ptit/ViRanker	0.6685	0.6564	0.6842	0.6811	0.7278	0.6985	2.02
itdainb/PhoRanker	0.6625	0.6458	0.7147	0.6731	0.7422	0.6830	15
kien-vu-uet/finetuned-phobert-passage-rerank-best-eval	0.0963	0.0883	0.1396	0.1131	0.1681	0.1246	15
BAAI/bge-reranker-v2-m3	0.6087	0.5841	0.6513	0.6062	0.6872	0.62091	3.51
BAAI/bge-reranker-v2-gemma	0.6088	0.5908	0.6446	0.6108	0.6785	0.6249	1.29

Citation

Please cite as

@misc{ViRanker,
  title={ViRanker: A Cross-encoder Model for Vietnamese Text Ranking},
  author={Nam Dang Phuong},
  year={2024},
  publisher={Huggingface},
}