ViRanker / README.md
Dang Phuong Nam
Update README.md
f056e20 verified
|
raw
history blame
No virus
3.72 kB
metadata
license: apache-2.0
language:
  - vi
library_name: transformers
pipeline_tag: text-classification
tags:
  - transformers
  - cross-encoder
  - rerank
datasets:
  - unicamp-dl/mmarco
widget:
  - text: tỉnh nào  diện tích lớn nhất việt nam.
    output:
      - label: nghệ an  diện tích lớn nhất việt nam
        score: 0.999989
      - label: bắc ninh  diện tích nhỏ nhất việt nam
        score: 0.372391

Reranker

Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. And the score can be mapped to a float value in [0,1] by sigmoid function.

Usage

Using FlagEmbedding

pip install -U FlagEmbedding

Get relevance scores (higher scores indicate more relevance):

from FlagEmbedding import FlagReranker

reranker = FlagReranker('namdp/bge-reranker-vietnamese',
                        use_fp16=True)  # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['query', 'passage'])
print(score)  # -5.65234375

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['query', 'passage'], normalize=True)
print(score)  # 0.003497010252573502

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
                                                            'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores)  # [-8.1875, 5.26171875]

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
                                                            'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']],
                                normalize=True)
print(scores)  # [0.00027803096387751553, 0.9948403768236574]

Using Huggingface transformers

pip install -U transformers

Get relevance scores (higher scores indicate more relevance):

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese')
model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?',
                                    'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

Fine-tune

Data Format

Train data should be a json file, where each line is a dict like this:

{"query": str, "pos": List[str], "neg": List[str]}

query is the query, and pos is a list of positive texts, neg is a list of negative texts, prompt indicates the relationship between query and texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.