ko-reranker / README.md
upskyy's picture
Upload folder using huggingface_hub
fc186c5 verified
|
raw
history blame
No virus
3.75 kB
---
language:
- ko
- en
- zh
license: mit
pipeline_tag: feature-extraction
tags:
- transformers
- sentence-transformers
- text-embeddings-inference
---
# upskyy/ko-reranker
**ko-reranker**λŠ” [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) λͺ¨λΈμ— [ν•œκ΅­μ–΄ 데이터](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)λ₯Ό finetuning ν•œ model μž…λ‹ˆλ‹€.
## Usage
## Using FlagEmbedding
```
pip install -U FlagEmbedding
```
Get relevance scores (higher scores indicate more relevance):
```python
from FlagEmbedding import FlagReranker
reranker = FlagReranker('upskyy/ko-reranker', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
score = reranker.compute_score(['query', 'passage'])
print(score) # -1.861328125
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['query', 'passage'], normalize=True)
print(score) # 0.13454832326359276
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores) # [-7.37109375, 8.5390625]
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
print(scores) # [0.0006287840192903181, 0.9998043646624727]
```
## Using Sentence-Transformers
```
pip install -U sentence-transformers
```
Get relevance scores (higher scores indicate more relevance):
```python
from sentence_transformers import SentenceTransformer
sentences_1 = ["경제 μ „λ¬Έκ°€κ°€ 금리 μΈν•˜μ— λŒ€ν•œ μ˜ˆμΈ‘μ„ ν•˜κ³  μžˆλ‹€.", "주식 μ‹œμž₯μ—μ„œ ν•œ νˆ¬μžμžκ°€ 주식을 λ§€μˆ˜ν•œλ‹€."]
sentences_2 = ["ν•œ νˆ¬μžμžκ°€ λΉ„νŠΈμ½”μΈμ„ λ§€μˆ˜ν•œλ‹€.", "금육 κ±°λž˜μ†Œμ—μ„œ μƒˆλ‘œμš΄ 디지털 μžμ‚°μ΄ 상μž₯λœλ‹€."]
model = SentenceTransformer('upskyy/ko-reranker')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
```
## Using Huggingface transformers
Get relevance scores (higher scores indicate more relevance):
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker')
model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker')
model.eval()
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
print(scores)
```
## Citation
```bibtex
@misc{bge_embedding,
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
year={2023},
eprint={2309.07597},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## License
FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.