Gameselo's picture
Update README.md
5dc955c verified
|
raw
history blame
No virus
19 kB
metadata
language: []
library_name: sentence-transformers
tags:
  - mteb
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dataset_size:100K<n<1M
  - loss:AnglELoss
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
  - source_sentence: 有些人在路上溜达。
    sentences:
      - Folk går
      - Otururken gitar çalan adam.
      - ארה"ב קבעה שסוריה השתמשה בנשק כימי
  - source_sentence: 緬甸以前稱為緬甸。
    sentences:
      - 缅甸以前叫缅甸。
      - This is very contradictory.
      -  남자가 아기를 안고 의자에 앉아 잠들어 있다.
  - source_sentence: אדם כותב.
    sentences:
      - האדם כותב.
      - questa non è una risposta.
      - 7 שוטרים נהרגו ו-4 שוטרים נפצעו.
  - source_sentence: הם מפחדים.
    sentences:
      - liên quan đến rủi ro đáng kể;
      - A man is playing a guitar.
      - A man is playing a piano.
  - source_sentence: 一个女人正在洗澡。
    sentences:
      - A woman is taking a bath.
      - En jente børster håret sitt
      - אדם מחלק תפוח אדמה.
pipeline_tag: sentence-similarity

State-of-the-Art Results Comparison (MTEB STS Multilingual Leaderboard)

Dataset State-of-the-art (Multi) STSb-XLM-RoBERTa-base STS Multilingual MPNet base v2
Average 73.17 71.68 73.89
STS17 (ar-ar) 81.87 80.43 81.24
STS17 (en-ar) 81.22 76.3 77.03
STS17 (en-de) 87.3 91.06 91.09
STS17 (en-tr) 77.18 80.74 79.87
STS17 (es-en) 88.24 83.09 85.53
STS17 (es-es) 88.25 84.16 87.27
STS17 (fr-en) 88.06 91.33 90.68
STS17 (it-en) 89.68 92.87 92.47
STS17 (ko-ko) 83.69 97.67 97.66
STS17 (nl-en) 88.25 92.13 91.15
STS22 (ar) 58.67 58.67 62.66
STS22 (de) 60.12 52.17 57.74
STS22 (de-en) 60.92 58.5 57.5
STS22 (de-fr) 67.79 51.28 57.99
STS22 (de-pl) 58.69 44.56 44.22
STS22 (es) 68.57 63.68 66.21
STS22 (es-en) 78.8 70.65 75.18
STS22 (es-it) 75.04 60.88 66.25
STS22 (fr) 83.75 76.46 78.76
STS22 (fr-pl) 84.52 84.52 84.52
STS22 (it) 79.28 66.73 68.47
STS22 (pl) 42.08 41.18 43.36
STS22 (pl-en) 77.5 64.35 75.11
STS22 (ru) 61.71 58.59 58.67
STS22 (tr) 68.72 57.52 63.84
STS22 (zh-en) 71.88 60.69 65.37
STSb 89.86 95.05 95.15

Bold indicates the best result in each row.

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Gameselo/STS-multilingual-mpnet-base-v2")
# Run inference
sentences = [
    '一个女人正在洗澡。',
    'A woman is taking a bath.',
    'En jente børster håret sitt',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.9551
spearman_cosine 0.9593
pearson_manhattan 0.927
spearman_manhattan 0.9383
pearson_euclidean 0.9278
spearman_euclidean 0.9394
pearson_dot 0.876
spearman_dot 0.8865
pearson_max 0.9551
spearman_max 0.9593

Evalutation results vs SOTA results

Metric Value
pearson_cosine 0.948
spearman_cosine 0.9515
pearson_manhattan 0.9252
spearman_manhattan 0.9352
pearson_euclidean 0.9258
spearman_euclidean 0.9364
pearson_dot 0.8443
spearman_dot 0.8435
pearson_max 0.948
spearman_max 0.9515

Training Details

Training Dataset

Unnamed Dataset

  • Size: 226,547 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 3 tokens
    • mean: 20.05 tokens
    • max: 128 tokens
    • min: 4 tokens
    • mean: 19.94 tokens
    • max: 128 tokens
    • min: 0.0
    • mean: 1.92
    • max: 398.6
  • Samples:
    sentence_0 sentence_1 label
    Bir kadın makineye dikiş dikiyor. Bir kadın biraz et ekiyor. 0.12
    Snowden 'gegeven vluchtelingendocument door Ecuador'. Snowden staat op het punt om uit Moskou te vliegen 0.24000000953674316
    Czarny pies idzie mostem przez wodę Czarny pies nie idzie mostem przez wodę 0.74000000954
  • Loss: AnglELoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_angle_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 256
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • prediction_loss_only: True
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 256
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss sts-dev_spearman_cosine sts-test_spearman_cosine
0.5650 500 10.9426 - -
1.0 885 - 0.9202 -
1.1299 1000 9.7184 - -
1.6949 1500 9.5348 - -
2.0 1770 - 0.9400 -
2.2599 2000 9.4412 - -
2.8249 2500 9.3097 - -
3.0 2655 - 0.9489 -
3.3898 3000 9.2357 - -
3.9548 3500 9.1594 - -
4.0 3540 - 0.9528 -
4.5198 4000 9.0963 - -
5.0 4425 - 0.9553 -
5.0847 4500 9.0382 - -
5.6497 5000 8.9837 - -
6.0 5310 - 0.9567 -
6.2147 5500 8.9403 - -
6.7797 6000 8.8841 - -
7.0 6195 - 0.9581 -
7.3446 6500 8.8513 - -
7.9096 7000 8.81 - -
8.0 7080 - 0.9582 -
8.4746 7500 8.8069 - -
9.0 7965 - 0.9589 -
9.0395 8000 8.7616 - -
9.6045 8500 8.7521 - -
10.0 8850 - 0.9593 0.6266

Framework Versions

  • Python: 3.9.7
  • Sentence Transformers: 3.0.0
  • Transformers: 4.40.1
  • PyTorch: 2.3.0+cu121
  • Accelerate: 0.29.3
  • Datasets: 2.19.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

AnglELoss

@misc{li2023angleoptimized,
    title={AnglE-optimized Text Embeddings}, 
    author={Xianming Li and Jing Li},
    year={2023},
    eprint={2309.12871},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}