π SapBERT-Ko-EN
1. Intro
νκ΅μ΄ λͺ¨λΈμ μ΄μ©ν SapBERT(Self-alignment pretraining for BERT)μ
λλ€.
νΒ·μ μλ£ μ©μ΄ μ¬μ μΈ KOSTOMμ μ¬μ©ν΄ νκ΅μ΄ μ©μ΄μ μμ΄ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€.
μ°Έκ³ : SapBERT, Original Code
2. SapBERT-KO-EN
SapBERTλ μλ§μ μλ£ λμμ΄λ₯Ό λμΌν μλ―Έλ‘ μ²λ¦¬νκΈ° μν μ¬μ νμ΅ λ°©λ²λ‘ μ
λλ€.
SapBERT-KO-ENλ νΒ·μ νΌμ©μ²΄μ μλ£ κΈ°λ‘μ μ²λ¦¬νκΈ° μν΄ νΒ·μ μλ£ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€.
β» μμΈν μ€λͺ λ° νμ΅ μ½λ: Github
3. Training
λͺ¨λΈ νμ΅μ νμ©ν λ² μ΄μ€ λͺ¨λΈ λ° νμ΄νΌ νλΌλ―Έν°λ λ€μκ³Ό κ°μ΅λλ€.
- Model : klue/bert-base
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60
SapBERT-KO-ENμ νμ Fine-tuningμ μ§ννλ λ°©μμΌλ‘ νΉμ ν μ€ν¬μ μ μ©ν μ μμ΅λλ€.
β» μμ΄ μ©μ΄μ κ²½μ° λλΆλΆ μνλ²³ λ¨μλ‘ μ²λ¦¬ν©λλ€.
β» λμΌν μ§λ³μ κ°λ¦¬ν€λ μ©μ΄ κ°μ μ μ¬λλ₯Ό μλμ μΌλ‘ ν¬κ² νκ°ν©λλ€.
import numpy as np
from transformers import AutoModel, AutoTokenizer
model_path = 'snumin44/sap-bert-ko-en'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
query = 'κ°κ²½ν'
targets = [
'liver cirrhosis',
'κ°κ²½λ³',
'liver cancer',
'κ°μ',
'brain tumor',
'λμ’
μ'
]
query_feature = tokenizer(query, return_tensors='pt')
query_outputs = model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
def cos_sim(A, B):
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
for idx, target in enumerate(targets):
target_feature = tokenizer(target, return_tensors='pt')
target_outputs = model(**target_feature, return_dict=True)
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
similarity = cos_sim(query_embeddings, target_embeddings)
print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.7145
Similarity between query and target 1: 0.7186
Similarity between query and target 2: 0.6183
Similarity between query and target 3: 0.6972
Similarity between query and target 4: 0.3929
Similarity between query and target 5: 0.4260
Citing
@inproceedings{liu2021self,
title={Self-Alignment Pretraining for Biomedical Entity Representations},
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={4228--4238},
month = jun,
year={2021}
}
- Downloads last month
- 2
Model tree for snumin44/sap-bert-ko-en
Base model
klue/bert-base