🍊 SapBERT-Ko-EN

1. Intro

ν•œκ΅­μ–΄ λͺ¨λΈμ„ μ΄μš©ν•œ SapBERT(Self-alignment pretraining for BERT)μž…λ‹ˆλ‹€.
ν•œΒ·μ˜ 의료 μš©μ–΄ 사전인 KOSTOM을 μ‚¬μš©ν•΄ ν•œκ΅­μ–΄ μš©μ–΄μ™€ μ˜μ–΄ μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€.
μ°Έκ³ : SapBERT, Original Code

2. SapBERT-KO-EN

SapBERTλŠ” μˆ˜λ§Žμ€ 의료 λ™μ˜μ–΄λ₯Ό λ™μΌν•œ 의미둜 μ²˜λ¦¬ν•˜κΈ° μœ„ν•œ 사전 ν•™μŠ΅ λ°©λ²•λ‘ μž…λ‹ˆλ‹€.
SapBERT-KO-ENλŠ” ν•œΒ·μ˜ 혼용체의 의료 기둝을 μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ ν•œΒ·μ˜ 의료 μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€.

β€» μžμ„Έν•œ μ„€λͺ… 및 ν•™μŠ΅ μ½”λ“œ: Github

3. Training

λͺ¨λΈ ν•™μŠ΅μ— ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  • Model : klue/bert-base
  • Epochs : 1
  • Batch Size : 64
  • Max Length : 64
  • Dropout : 0.1
  • Pooler : 'cls'
  • Eval Step : 100
  • Threshold : 0.8
  • Scale Positive Sample : 1
  • Scale Negative Sample : 60

SapBERT-KO-EN에 후속 Fine-tuning을 μ§„ν–‰ν•˜λŠ” λ°©μ‹μœΌλ‘œ νŠΉμ • ν…ŒμŠ€ν¬μ— μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

β€» μ˜μ–΄ μš©μ–΄μ˜ 경우 λŒ€λΆ€λΆ„ μ•ŒνŒŒλ²³ λ‹¨μœ„λ‘œ μ²˜λ¦¬ν•©λ‹ˆλ‹€.
β€» λ™μΌν•œ μ§ˆλ³‘μ„ κ°€λ¦¬ν‚€λŠ” μš©μ–΄ κ°„μ˜ μœ μ‚¬λ„λ₯Ό μƒλŒ€μ μœΌλ‘œ 크게 ν‰κ°€ν•©λ‹ˆλ‹€.

import numpy as np
from transformers import AutoModel, AutoTokenizer

model_path = 'snumin44/sap-bert-ko-en'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

query = 'κ°„κ²½ν™”'

targets = [
    'liver cirrhosis',
    'κ°„κ²½λ³€',
    'liver cancer',
    'κ°„μ•”',
    'brain tumor',
    'λ‡Œμ’…μ–‘'
]

query_feature = tokenizer(query, return_tensors='pt')
query_outputs = model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = tokenizer(target, return_tensors='pt')
    target_outputs = model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.7145
Similarity between query and target 1: 0.7186
Similarity between query and target 2: 0.6183
Similarity between query and target 3: 0.6972
Similarity between query and target 4: 0.3929
Similarity between query and target 5: 0.4260

Citing

@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
Downloads last month
2
Safetensors
Model size
111M params
Tensor type
F32
Β·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for snumin44/sap-bert-ko-en

Base model

klue/bert-base
Finetuned
(68)
this model