|
---
|
|
language:
|
|
- ko
|
|
- en
|
|
- zh
|
|
license: mit
|
|
pipeline_tag: feature-extraction
|
|
tags:
|
|
- transformers
|
|
- sentence-transformers
|
|
- text-embeddings-inference
|
|
---
|
|
|
|
|
|
|
|
# upskyy/ko-reranker
|
|
|
|
**ko-reranker**λ [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) λͺ¨λΈμ [νκ΅μ΄ λ°μ΄ν°](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)λ₯Ό finetuning ν model μ
λλ€.
|
|
|
|
## Usage
|
|
## Using FlagEmbedding
|
|
```
|
|
pip install -U FlagEmbedding
|
|
```
|
|
|
|
Get relevance scores (higher scores indicate more relevance):
|
|
|
|
```python
|
|
from FlagEmbedding import FlagReranker
|
|
|
|
|
|
reranker = FlagReranker('upskyy/ko-reranker', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
|
|
|
score = reranker.compute_score(['query', 'passage'])
|
|
print(score) # -1.861328125
|
|
|
|
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
|
|
score = reranker.compute_score(['query', 'passage'], normalize=True)
|
|
print(score) # 0.13454832326359276
|
|
|
|
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
|
|
print(scores) # [-7.37109375, 8.5390625]
|
|
|
|
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
|
|
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
|
|
print(scores) # [0.0006287840192903181, 0.9998043646624727]
|
|
```
|
|
|
|
## Using Sentence-Transformers
|
|
|
|
```
|
|
pip install -U sentence-transformers
|
|
```
|
|
|
|
Get relevance scores (higher scores indicate more relevance):
|
|
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
|
|
sentences_1 = ["κ²½μ μ λ¬Έκ°κ° κΈλ¦¬ μΈνμ λν μμΈ‘μ νκ³ μλ€.", "μ£Όμ μμ₯μμ ν ν¬μμκ° μ£Όμμ 맀μνλ€."]
|
|
sentences_2 = ["ν ν¬μμκ° λΉνΈμ½μΈμ 맀μνλ€.", "κΈμ΅ κ±°λμμμ μλ‘μ΄ λμ§νΈ μμ°μ΄ μμ₯λλ€."]
|
|
|
|
model = SentenceTransformer('upskyy/ko-reranker')
|
|
|
|
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
|
|
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
|
|
similarity = embeddings_1 @ embeddings_2.T
|
|
|
|
print(similarity)
|
|
```
|
|
|
|
## Using Huggingface transformers
|
|
|
|
Get relevance scores (higher scores indicate more relevance):
|
|
|
|
|
|
```python
|
|
import torch
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker')
|
|
model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker')
|
|
model.eval()
|
|
|
|
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
|
|
|
|
with torch.no_grad():
|
|
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
|
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
|
|
print(scores)
|
|
```
|
|
|
|
|
|
|
|
## Citation
|
|
|
|
```bibtex
|
|
@misc{bge_embedding,
|
|
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
|
|
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
|
|
year={2023},
|
|
eprint={2309.07597},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.CL}
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge. |