SentenceTransformer based on NeuML/pubmedbert-base-embeddings

This is a sentence-transformers model finetuned from NeuML/pubmedbert-base-embeddings on the mimic10-hard-negatives dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: NeuML/pubmedbert-base-embeddings
Maximum Sequence Length: 64 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity
Training Dataset:
- mimic10-hard-negatives

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 64, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("alecocc/icd10-hard-negatives")
# Run inference
sentences = [
    'CAD',
    'Atherosclerotic heart disease of native coronary artery with unspecified angina pectoris',
    'Myopia, bilateral',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

mimic10-hard-negatives

Dataset: mimic10-hard-negatives at ef88fe5
Size: 473,546 training samples
Columns: anchor, positive, negative_1, negative_2, negative_3, negative_4, negative_5, negative_6, negative_7, negative_8, negative_9, and negative_10

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative_1	negative_2	negative_3	negative_4	negative_5	negative_6	negative_7	negative_8	negative_9	negative_10
type	string	string	string	string	string	string	string	string	string	string	string	string
details	min: 3 tokens mean: 4.53 tokens max: 14 tokens	min: 3 tokens mean: 9.67 tokens max: 40 tokens	min: 3 tokens mean: 10.19 tokens max: 40 tokens	min: 3 tokens mean: 10.49 tokens max: 40 tokens	min: 3 tokens mean: 10.8 tokens max: 40 tokens	min: 3 tokens mean: 11.1 tokens max: 40 tokens	min: 3 tokens mean: 11.64 tokens max: 38 tokens	min: 3 tokens mean: 15.14 tokens max: 37 tokens	min: 3 tokens mean: 15.58 tokens max: 40 tokens	min: 4 tokens mean: 15.1 tokens max: 40 tokens	min: 3 tokens mean: 14.96 tokens max: 37 tokens	min: 3 tokens mean: 15.35 tokens max: 38 tokens

Samples:

anchor	positive	negative_1	negative_2	negative_3	negative_4	negative_5	negative_6	negative_7	negative_8	negative_9	negative_10
`Anterior exenteration`	`Malignant neoplasm of bladder neck`	`Malignant neoplasm of unspecified kidney, except renal pelvis`	`Malignant neoplasm of unspecified renal pelvis`	`Malignant neoplasm of left ureter`	`Malignant neoplasm of paraurethral glands`	`Malignant neoplasm of left renal pelvis`	`Unspecified kyphosis, cervical region`	`Unspecified superficial injuries of left back wall of thorax, initial encounter`	`Dome fracture of acetabulum`	`Other fracture of left great toe, initial encounter for open fracture`	`Unspecified fracture of upper end of unspecified radius, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with malunion`
`Atorvastatin`	`Hyperlipidemia, unspecified`	`Other lactose intolerance`	`Lipomatosis, not elsewhere classified`	`Mucopolysaccharidosis, type II`	`Hyperuricemia without signs of inflammatory arthritis and tophaceous disease`	`Volume depletion, unspecified`	`Glaucoma secondary to other eye disorders, unspecified eye`	`Fracture of one rib, left side, subsequent encounter for fracture with routine healing`	`Toxic effect of other tobacco and nicotine, accidental (unintentional), sequela`	`Puncture wound without foreign body of left ring finger with damage to nail`	`Nondisplaced fracture of epiphysis (separation) (upper) of unspecified femur, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with nonunion`
`Urostomy`	`Malignant neoplasm of bladder neck`	`Malignant neoplasm of urinary organ, unspecified`	`Malignant neoplasm of overlapping sites of urinary organs`	`Malignant neoplasm of left ureter`	`Malignant neoplasm of urethra`	`Malignant neoplasm of left renal pelvis`	`Indeterminate leprosy`	`Poisoning by other viral vaccines, accidental (unintentional)`	`Fracture of unspecified metatarsal bone(s), right foot, initial encounter for open fracture`	`Sprain of tarsometatarsal ligament of unspecified foot, subsequent encounter`	`Burn of first degree of multiple sites of left ankle and foot, initial encounter`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 128
per_device_eval_batch_size: 128
learning_rate: 2e-05
num_train_epochs: 2
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss
0.0270	100	4.1948
0.0541	200	3.5402
0.0811	300	3.2462
0.1081	400	2.9691
0.1351	500	2.788
0.1622	600	2.5922
0.1892	700	2.5648
0.2162	800	2.4821
0.2432	900	2.47
0.2703	1000	2.3774
0.2973	1100	2.3415
0.3243	1200	2.2428
0.3514	1300	2.2794
0.3784	1400	2.2372
0.4054	1500	2.2265
0.4324	1600	2.2186
0.4595	1700	2.2074
0.4865	1800	2.159
0.5135	1900	2.1903
0.5405	2000	2.1328
0.5676	2100	2.0685
0.5946	2200	2.1249
0.6216	2300	2.1321
0.6486	2400	2.0725
0.6757	2500	2.0913
0.7027	2600	2.0192
0.7297	2700	2.036
0.7568	2800	1.9863
0.7838	2900	2.0411
0.8108	3000	1.9796
0.8378	3100	2.0102
0.8649	3200	1.8652
0.8919	3300	1.0192
0.9189	3400	0.9623
0.9459	3500	0.957
0.9730	3600	0.8579
1.0	3700	0.7984
1.0270	3800	0.6359
1.0541	3900	0.7348
1.0811	4000	0.6356
1.1081	4100	0.6252
1.1351	4200	0.6587
1.1622	4300	0.602
1.1892	4400	0.6803
1.2162	4500	0.6204
1.2432	4600	0.667
1.2703	4700	0.6253
1.2973	4800	0.5375
1.3243	4900	0.6054
1.3514	5000	0.4541
1.3784	5100	0.5334
1.4054	5200	0.6075
1.4324	5300	0.5037
1.4595	5400	0.4825
1.4865	5500	0.5442
1.5135	5600	0.4999
1.5405	5700	0.6521
1.5676	5800	0.5769
1.5946	5900	0.5029
1.6216	6000	0.5787
1.6486	6100	0.5235
1.6757	6200	0.5839
1.7027	6300	0.5339
1.7297	6400	0.5339
1.7568	6500	0.4515
1.7838	6600	0.5648
1.8108	6700	0.4355
1.8378	6800	0.5321
1.8649	6900	0.4778
1.8919	7000	0.4884
1.9189	7100	0.5941
1.9459	7200	0.5489
1.9730	7300	0.444
2.0	7400	0.4964

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.2.1
Transformers: 4.45.2
PyTorch: 2.1.2+cu121
Accelerate: 0.29.0.dev0
Datasets: 2.18.0
Tokenizers: 0.20.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

alecocc
/

icd10-hard-negatives