|
--- |
|
license: apache-2.0 |
|
language: |
|
- vi |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
tags: |
|
- transformers |
|
- cross-encoder |
|
- rerank |
|
datasets: |
|
- unicamp-dl/mmarco |
|
widget: |
|
- text: tỉnh nào có diện tích lớn nhất việt nam |
|
output: |
|
- label: >- |
|
nghệ an có diện tích lớn nhất việt nam |
|
score: 0.9999 |
|
- label: >- |
|
bắc ninh có diện tích nhỏ nhất việt nam |
|
score: 0.1705 |
|
--- |
|
|
|
# Reranker |
|
|
|
* [Usage](#usage) |
|
* [Using FlagEmbedding](#using-flagembedding) |
|
* [Using Huggingface transformers](#using-huggingface-transformers) |
|
* [Fine tune](#fine-tune) |
|
* [Data format](#data-format) |
|
|
|
Different from embedding model, reranker uses question and document as input and directly output similarity instead of |
|
embedding. |
|
You can get a relevance score by inputting query and passage to the reranker. |
|
And the score can be mapped to a float value in [0,1] by sigmoid function. |
|
|
|
## Usage |
|
|
|
### Using FlagEmbedding |
|
|
|
``` |
|
pip install -U FlagEmbedding |
|
``` |
|
|
|
Get relevance scores (higher scores indicate more relevance): |
|
|
|
```python |
|
from FlagEmbedding import FlagReranker |
|
|
|
reranker = FlagReranker('namdp/bge-reranker-vietnamese', |
|
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation |
|
|
|
score = reranker.compute_score(['query', 'passage']) |
|
print(score) # -5.65234375 |
|
|
|
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score |
|
score = reranker.compute_score(['query', 'passage'], normalize=True) |
|
print(score) # 0.003497010252573502 |
|
|
|
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', |
|
'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) |
|
print(scores) # [-8.1875, 5.26171875] |
|
|
|
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score |
|
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', |
|
'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], |
|
normalize=True) |
|
print(scores) # [0.00027803096387751553, 0.9948403768236574] |
|
``` |
|
|
|
### Using Huggingface transformers |
|
|
|
``` |
|
pip install -U transformers |
|
``` |
|
|
|
Get relevance scores (higher scores indicate more relevance): |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese') |
|
model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese') |
|
model.eval() |
|
|
|
pairs = [['what is panda?', 'hi'], ['what is panda?', |
|
'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] |
|
with torch.no_grad(): |
|
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) |
|
scores = model(**inputs, return_dict=True).logits.view(-1, ).float() |
|
print(scores) |
|
``` |
|
|
|
## Fine-tune |
|
|
|
### Data Format |
|
|
|
Train data should be a json file, where each line is a dict like this: |
|
|
|
``` |
|
{"query": str, "pos": List[str], "neg": List[str]} |
|
``` |
|
|
|
`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the |
|
relationship between query and texts. If you have no negative texts for a query, you can random sample some from the |
|
entire corpus as the negatives. |