--- license: apache-2.0 language: - vi library_name: transformers pipeline_tag: text-classification tags: - transformers - cross-encoder - rerank datasets: - unicamp-dl/mmarco widget: - text: tỉnh nào có diện tích lớn nhất việt nam. output: - label: >- nghệ an có diện tích lớn nhất việt nam score: 0.999989 - label: >- bắc ninh có diện tích nhỏ nhất việt nam score: 0.372391 --- # Reranker * [Usage](#usage) * [Using FlagEmbedding](#using-flagembedding) * [Using Huggingface transformers](#using-huggingface-transformers) * [Fine tune](#fine-tune) * [Data format](#data-format) Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. And the score can be mapped to a float value in [0,1] by sigmoid function. ## Usage ### Using FlagEmbedding ``` pip install -U FlagEmbedding ``` Get relevance scores (higher scores indicate more relevance): ```python from FlagEmbedding import FlagReranker reranker = FlagReranker('namdp/bge-reranker-vietnamese', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) # -5.65234375 # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score score = reranker.compute_score(['query', 'passage'], normalize=True) print(score) # 0.003497010252573502 scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) # [-8.1875, 5.26171875] # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True) print(scores) # [0.00027803096387751553, 0.9948403768236574] ``` ### Using Huggingface transformers ``` pip install -U transformers ``` Get relevance scores (higher scores indicate more relevance): ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese') model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) ``` ## Fine-tune ### Data Format Train data should be a json file, where each line is a dict like this: ``` {"query": str, "pos": List[str], "neg": List[str]} ``` `query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the relationship between query and texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.