Sentence Transformers

We are forking sentence-transformers/all-MiniLM-L6-v2 as it is similar to the targeting dataset and use case. For more details, please check the pre-trained model weight repository.

Fine-tuning

  • Fine-tune the model using a contrastive objective.
  • Compute the cosine similarity from each possible sentence pairs from the batch.
  • Then apply the cross entropy loss by comparing with true pairs.

Hyper parameters

  • Train the model during 100k steps using a batch size of 1024 (128 per TPU core).
  • Use a learning rate warm up of 500.
  • The sequence length was limited to 128 tokens.
  • Used the AdamW optimizer with a 2e-5 learning rate.
  • The full training script is accessible in this current repository: train_script.py.

Performance

Model Name Performance Sentence Embeddings (14 Datasets) Performance Semantic Search (6 Datasets) Avg. Performance Speed Model Size
all-mpnet-base-v2 69.57 57.02 63.30 2800 420 MB
multi-qa-mpnet-base-dot-v1 66.76 57.60 62.18 2800 420 MB
all-distilroberta-v1 68.73 50.94 59.84 4000 290 MB
all-MiniLM-L12-v2 68.70 50.82 59.76 7500 120 MB
multi-qa-distilbert-cos-v1 65.98 52.83 59.41 4000 250 MB
all-MiniLM-L6-v2 (This model) 68.06 49.54 58.80 14200 80 MB
multi-qa-MiniLM-L6-cos-v1 64.33 51.83 58.08 14200 80 MB
paraphrase-multilingual-mpnet-base-v2 65.83 41.68 53.75 2500 970 MB
paraphrase-albert-small-v2 64.46 40.04 52.25 5000 43 MB
paraphrase-multilingual-MiniLM-L12-v2 64.25 39.19 51.72 7500 420 MB
paraphrase-MiniLM-L3-v2 62.29 39.19 50.74 19000 61 MB
distiluse-base-multilingual-cased-v1 61.30 29.87 45.59 4000 480 MB
distiluse-base-multilingual-cased-v2 60.18 27.35 43.77 4000 480 MB

Datasets

Dataset Paper Number of training tuples
Reddit comments (2015-2018) paper 726,484,430
S2ORC Citation pairs (Abstracts) paper 116,288,806
WikiAnswers Duplicate question pairs paper 77,427,422
PAQ (Question, Answer) pairs paper 64,371,441
S2ORC Citation pairs (Titles) paper 52,603,982
S2ORC (Title, Abstract) paper 41,769,185
Stack Exchange (Title, Body) pairs - 25,316,456
Stack Exchange (Title+Body, Answer) pairs - 21,396,559
Stack Exchange (Title, Answer) pairs - 21,396,559
MS MARCO triplets paper 9,144,553
GOOAQ: Open Question Answering with Diverse Answer Types paper 3,012,496
Yahoo Answers (Title, Answer) paper 1,198,260
Code Search - 1,151,414
COCO Image captions paper 828,395
SPECTER citation triplets paper 684,100
Yahoo Answers (Question, Answer) paper 681,164
Yahoo Answers (Title, Question) paper 659,896
SearchQA paper 582,261
Eli5 paper 325,475
Flickr 30k paper 317,695
Stack Exchange Duplicate questions (titles) 304,525
AllNLI (SNLI and MultiNLI paper SNLI, paper MultiNLI 277,230
Stack Exchange Duplicate questions (bodies) 250,519
Stack Exchange Duplicate questions (titles+bodies) 250,460
Sentence Compression paper 180,000
Wikihow paper 128,542
Altlex paper 112,696
Quora Question Triplets - 103,663
Simple Wikipedia paper 102,225
Natural Questions (NQ) paper 100,231
SQuAD2.0 paper 87,599
TriviaQA - 73,346
Total 1,170,060,424
Downloads last month
17
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train danielpark/sentence-transformers-all-mini-lm-l6-v2