--- license: mit language: - ko base_model: - klue/bert-base pipeline_tag: feature-extraction tags: - medical --- # ๐ŸŠ Korean Medical DPR(Dense Passage Retrieval) ## 1. Intro **์˜๋ฃŒ ๋ถ„์•ผ**์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Bi-Encoder ๊ตฌ์กฐ์˜ ๊ฒ€์ƒ‰ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œยท์˜ ํ˜ผ์šฉ์ฒด์˜ ์˜๋ฃŒ ๊ธฐ๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด **SapBERT-KO-EN** ์„ ๋ฒ ์ด์Šค ๋ชจ๋ธ๋กœ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์งˆ๋ฌธ์€ Question Encoder๋กœ, ํ…์ŠคํŠธ๋Š” Context Encoder๋ฅผ ์ด์šฉํ•ด ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. - Question Encoder : [https://huggingface.co/snumin44/medical-biencoder-ko-bert-question](https://huggingface.co/snumin44/medical-biencoder-ko-bert-question) (โ€ป ์ด ๋ชจ๋ธ์€ AI Hub์˜ [์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71762)๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.) ## 2. Model **(1) Self Alignment Pretraining (SAP)** ํ•œ๊ตญ ์˜๋ฃŒ ๊ธฐ๋ก์€ **ํ•œยท์˜ ํ˜ผ์šฉ์ฒด**๋กœ ์“ฐ์—ฌ, ์˜์–ด ์šฉ์–ด๋„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Multi Similarity Loss๋ฅผ ์ด์šฉํ•ด **๋™์ผํ•œ ์ฝ”๋“œ์˜ ์šฉ์–ด** ๊ฐ„์— ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๊ฐ–๋„๋ก ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค. ``` ์˜ˆ) C3843080 || ๊ณ ํ˜ˆ์•• ์งˆํ™˜ C3843080 || Hypertension C3843080 || High Blood Pressure C3843080 || HTN C3843080 || HBP ``` - SapBERT-KO-EN : [https://huggingface.co/snumin44/sap-bert-ko-en](https://huggingface.co/snumin44/sap-bert-ko-en) - Github : [https://github.com/snumin44/SapBERT-KO-EN](https://github.com/millet04/SapBERT-KO-EN) **(2) Dense Passage Retrieval (DPR)** SapBERT-KO-EN์„ ๊ฒ€์ƒ‰ ๋ชจ๋ธ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ Fine-tuning์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Bi-Encoder ๊ตฌ์กฐ๋กœ ์งˆ์˜์™€ ํ…์ŠคํŠธ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” DPR ๋ฐฉ์‹์œผ๋กœ Fine-tuning ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์…‹์— **ํ•œยท์˜ ํ˜ผ์šฉ์ฒด ์ƒ˜ํ”Œ์„ ์ฆ๊ฐ•**ํ•œ ๋ฐ์ดํ„ฐ ์…‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ``` ์˜ˆ) ํ•œ๊ตญ์–ด ๋ณ‘๋ช…: ๊ณ ํ˜ˆ์•• ์˜์–ด ๋ณ‘๋ช…: Hypertenstion ์งˆ์˜ (์›๋ณธ): ์•„๋ฒ„์ง€๊ฐ€ ๊ณ ํ˜ˆ์••์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. ๊ณ ํ˜ˆ์••์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜. ์งˆ์˜ (์ฆ๊ฐ•): ์•„๋ฒ„์ง€๊ฐ€ Hypertenstion ์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. Hypertenstion ์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜. ``` - Github : [https://github.com/millet04/DPR-KO](https://github.com/millet04/DPR-KO) ## 3. Training **(1) Self Alignment Pretraining (SAP)** SapBERT-KO-EN ํ•™์Šต์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด๋ฅผ ์ˆ˜๋กํ•œ ์˜๋ฃŒ ์šฉ์–ด ์‚ฌ์ „์ธ **KOSTOM**์„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. - Model : klue/bert-base - Dataset : **KOSTOM** - Epochs : 1 - Batch Size : 64 - Max Length : 64 - Dropout : 0.1 - Pooler : 'cls' - Eval Step : 100 - Threshold : 0.8 - Scale Positive Sample : 1 - Scale Negative Sample : 60 **(2) Dense Passage Retrieval (DPR)** Fine-tuning์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. - Model : SapBERT-KO-EN(klue/bert-base) - Dataset : **์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ(AI Hub)** - Epochs : 10 - Batch Size : 64 - Dropout : 0.1 - Pooler : 'cls' ## 4. Example ์ด ๋ชจ๋ธ์€ Context๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ชจ๋ธ๋กœ, Question ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋™์ผํ•œ ์งˆ๋ณ‘์— ๊ด€ํ•œ ์งˆ๋ฌธ๊ณผ ํ…์ŠคํŠธ๊ฐ€ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (โ€ป ์•„๋ž˜ ์ฝ”๋“œ์˜ ์˜ˆ์‹œ๋Š” ChatGPT๋ฅผ ์ด์šฉํ•ด ์ƒ์„ฑํ•œ ์˜๋ฃŒ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค.) (โ€ป ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ์ƒ ์˜ˆ์‹œ ๋ณด๋‹ค ์ •์ œ๋œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๋” ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.) ```python import numpy as np from transformers import AutoModel, AutoTokenizer # Question Model q_model_path = 'snumin44/medical-biencoder-ko-bert-question' q_model = AutoModel.from_pretrained(q_model_path) q_tokenizer = AutoTokenizer.from_pretrained(q_model_path) # Context Model c_model_path = 'snumin44/medical-biencoder-ko-bert-context' c_model = AutoModel.from_pretrained(c_model_path) c_tokenizer = AutoTokenizer.from_pretrained(c_model_path) query = 'high blood pressure ์ฒ˜๋ฐฉ ์‚ฌ๋ก€' targets = [ """๊ณ ํ˜ˆ์•• ์ง„๋‹จ. ํ™˜์ž ์ƒ๋‹ด ๋ฐ ์ƒํ™œ์Šต๊ด€ ๊ต์ • ๊ถŒ๊ณ . ์ €์—ผ์‹, ๊ทœ์น™์ ์ธ ์šด๋™, ๊ธˆ์—ฐ, ๊ธˆ์ฃผ ์ง€์‹œ. ํ™˜์ž ์žฌ๋ฐฉ๋ฌธ. ํ˜ˆ์••: 150/95mmHg. ์•ฝ๋ฌผ์น˜๋ฃŒ ์‹œ์ž‘. Amlodipine 5mg 1์ผ 1ํšŒ ์ฒ˜๋ฐฉ.""", """์‘๊ธ‰์‹ค ๋„์ฐฉ ํ›„ ์œ„ ๋‚ด์‹œ๊ฒฝ ์ง„ํ–‰. ์†Œ๊ฒฌ: Gastric ulcer์—์„œ Forrest IIb ๊ด€์ฐฐ๋จ. ์ถœํ˜ˆ์€ ์†Œ๋Ÿ‰์˜ ์‚ผ์ถœ์„ฑ ์ถœํ˜ˆ ํ˜•ํƒœ. ์ฒ˜์น˜: ์—ํ”ผ๋„คํ”„๋ฆฐ ์ฃผ์‚ฌ๋กœ ์ถœํ˜ˆ ๊ฐ์†Œ ํ™•์ธ. Hemoclip 2๊ฐœ๋กœ ์ถœํ˜ˆ ๋ถ€์œ„ ํด๋ฆฌํ•‘ํ•˜์—ฌ ์ง€ํ˜ˆ ์™„๋ฃŒ.""", """ํ˜ˆ์ค‘ ๋†’์€ ์ง€๋ฐฉ ์ˆ˜์น˜ ๋ฐ ์ง€๋ฐฉ๊ฐ„ ์†Œ๊ฒฌ. ๋‹ค๋ฐœ์„ฑ gallstones ํ™•์ธ. ์ฆ์ƒ ์—†์„ ๊ฒฝ์šฐ ๊ฒฝ๊ณผ ๊ด€์ฐฐ ๊ถŒ์žฅ. ์šฐ์ธก renal cyst, ์–‘์„ฑ ๊ฐ€๋Šฅ์„ฑ ๋†’์œผ๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์ฒ˜์น˜ ๋ถˆํ•„์š” ํ•จ.""" ] query_feature = q_tokenizer(query, return_tensors='pt') query_outputs = q_model(**query_feature, return_dict=True) query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze() def cos_sim(A, B): return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) for idx, target in enumerate(targets): target_feature = c_tokenizer(target, return_tensors='pt') target_outputs = c_model(**target_feature, return_dict=True) target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze() similarity = cos_sim(query_embeddings, target_embeddings) print(f"Similarity between query and target {idx}: {similarity:.4f}") ``` ``` Similarity between query and target 0: 0.2674 Similarity between query and target 1: 0.0416 Similarity between query and target 2: 0.0476 ``` ## Citing ``` @inproceedings{liu2021self, title={Self-Alignment Pretraining for Biomedical Entity Representations}, author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel}, booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={4228--4238}, month = jun, year={2021} } @article{karpukhin2020dense, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin, Barlas OฤŸuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih}, journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2020} } ```