namdp-ptit
/

ViRanker

Text Classification

Inference Endpoints

Model card Files Files and versions Community

ViRanker / README.md

Dang Phuong Nam

Update README.md

9122358 verified 3 months ago

|

3.72 kB

	---
	license: apache-2.0
	language:
	- vi
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- transformers
	- cross-encoder
	- rerank
	datasets:
	- unicamp-dl/mmarco
	widget:
	- text: tỉnh nào có diện tích lớn nhất việt nam.
	output:
	- label: >-
	nghệ an có diện tích lớn nhất việt nam
	score: 0.9999
	- label: >-
	bắc ninh có diện tích nhỏ nhất việt nam
	score: 0.3723
	---

	# Reranker

	* [Usage](#usage)
	* [Using FlagEmbedding](#using-flagembedding)
	* [Using Huggingface transformers](#using-huggingface-transformers)
	* [Fine tune](#fine-tune)
	* [Data format](#data-format)

	Different from embedding model, reranker uses question and document as input and directly output similarity instead of
	embedding.
	You can get a relevance score by inputting query and passage to the reranker.
	And the score can be mapped to a float value in [0,1] by sigmoid function.

	## Usage

	### Using FlagEmbedding

	```
	pip install -U FlagEmbedding
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	from FlagEmbedding import FlagReranker

	reranker = FlagReranker('namdp/bge-reranker-vietnamese',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	score = reranker.compute_score(['query', 'passage'])
	print(score) # -5.65234375

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	score = reranker.compute_score(['query', 'passage'], normalize=True)
	print(score) # 0.003497010252573502

	scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
	'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
	print(scores) # [-8.1875, 5.26171875]

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
	'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']],
	normalize=True)
	print(scores) # [0.00027803096387751553, 0.9948403768236574]
	```

	### Using Huggingface transformers

	```
	pip install -U transformers
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese')
	model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese')
	model.eval()

	pairs = [['what is panda?', 'hi'], ['what is panda?',
	'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
	with torch.no_grad():
	inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
	scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
	print(scores)
	```

	## Fine-tune

	### Data Format

	Train data should be a json file, where each line is a dict like this:

	```
	{"query": str, "pos": List[str], "neg": List[str]}
	```

	`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the
	relationship between query and texts. If you have no negative texts for a query, you can random sample some from the
	entire corpus as the negatives.