ko-reranker / README.md

upskyy

Upload folder using huggingface_hub

fc186c5 verified about 1 month ago

preview code

raw

history blame

No virus

3.75 kB

	---
	language:
	- ko
	- en
	- zh
	license: mit
	pipeline_tag: feature-extraction
	tags:
	- transformers
	- sentence-transformers
	- text-embeddings-inference
	---



	# upskyy/ko-reranker

	ko-reranker는 [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) 모델에 [한국어 데이터](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)를 finetuning 한 model 입니다.

	## Usage
	## Using FlagEmbedding
	```
	pip install -U FlagEmbedding
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	from FlagEmbedding import FlagReranker


	reranker = FlagReranker('upskyy/ko-reranker', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	score = reranker.compute_score(['query', 'passage'])
	print(score) # -1.861328125

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	score = reranker.compute_score(['query', 'passage'], normalize=True)
	print(score) # 0.13454832326359276

	scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
	print(scores) # [-7.37109375, 8.5390625]

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
	print(scores) # [0.0006287840192903181, 0.9998043646624727]
	```

	## Using Sentence-Transformers

	```
	pip install -U sentence-transformers
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	from sentence_transformers import SentenceTransformer


	sentences_1 = ["경제 전문가가 금리 인하에 대한 예측을 하고 있다.", "주식 시장에서 한 투자자가 주식을 매수한다."]
	sentences_2 = ["한 투자자가 비트코인을 매수한다.", "금융 거래소에서 새로운 디지털 자산이 상장된다."]

	model = SentenceTransformer('upskyy/ko-reranker')

	embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
	embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
	similarity = embeddings_1 @ embeddings_2.T

	print(similarity)
	```

	## Using Huggingface transformers

	Get relevance scores (higher scores indicate more relevance):


	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer


	tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker')
	model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker')
	model.eval()

	pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

	with torch.no_grad():
	inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
	scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
	print(scores)
	```



	## Citation

	```bibtex
	@misc{bge_embedding,
	title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
	author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
	year={2023},
	eprint={2309.07597},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	## License

	FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.