SentenceTransformer based on Alibaba-NLP/gte-multilingual-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-multilingual-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    "ليس داعشياً من بيده المسدس ..انه جندي فرنسي ينفذ اعدامات بحق مواطنين عزل في الجزائر !!! لم يكن حينها لا تنظيم قاعدة ولا دولة اسلامية ولا نصرة ليلصقوا بهم منفردين تهمة الارهاب !! انتم ام واب واخ وابن وجد الارهاب ..  Not Daashaa of the pistol in his hand .. he's a French soldier executions carried out against unarmed civilians in Algeria !!! If not then it does not regulate not base an Islamic state nor a victory for Alsqoa their individual terrorism charge !! You are a mother and father and brother and the son of terror found ..  Non Daashaa du pistolet dans sa main .. Il est un soldat français exécutions menées contre des civils non armés en Algérie !!! Si non, alors il ne réglemente pas pas fonder un Etat islamique, ni une victoire pour Alsqoa leur charge individuelle du terrorisme !! Vous êtes une mère et père et le frère et le fils de la terreur trouvé .. # occupant",
    'Massacre perpétré par des soldats français en Algérie',
    'Video Of Attack On UP Minister Shrikant Sharma',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 25,743 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 2 tokens
    • mean: 140.38 tokens
    • max: 2514 tokens
    • min: 5 tokens
    • mean: 20.49 tokens
    • max: 141 tokens
    • min: 1.0
    • mean: 1.0
    • max: 1.0
  • Samples:
    sentence_0 sentence_1 label
    Olhem aí a mineradora da Noruega destruindo o meio ambiente na Amazônia. Lula vendeu o solo para a Noruega em documento secreto. Ela arrecada 2 bilhoes ao ano e devolve 180 milhoes para consertar o estrago que ela mesmo faz na Amazônia. O ex-presidente Lula vendeu o solo da Amazônia para uma empresa norueguesa 1.0
    EL CONGRESO DANIE Cometió una burrada Al aprobar en primera votación con 113 votos a favor, 5 en contra y una abstención, que la vacuna contra el coronavirus sea de manera OBLIGATORIA para todos Que les pasa a estos genios de la política, acaso no saben que están violando leyes universales de Derechos Humanos¿Qué les pasa a estos congresistas?. . ¿ Acaso desconocen y pisotean las leyes internacionales que respaldan los Derechos Humanos Universales ???. . Absolutamente nadie puede ser obligado a vacunarse. . Igualmente, ningún procedimiento médico puede hacerse sin el consentimiento del paciente. . No lo digo yo, lo dice la UNESCO,la Organización de las Naciones Unidas para la Educación, la Ciencia y la Cultura.... Que en sus normativas explican lo siguiente : . SOLO UNO MISMO TIENE EL CONTROL DE SU PROPIO CUERPO, nadie tiene el control de nuestro cuerpo más que uno mismo, nadie puede intervenir en nuestro cuerpo bajo ninguna circunstancia sin nuestro consentimiento. . Legalmente bajo t... En Perú el Congreso aprobó que la vacuna contra el covid-19 sea obligatoria 1.0
    Why changes to Legislation is so difficult. Debating PTSD in Emergency Services Debating Mental Health Stigma Debating Workers Compensation Debating Cancer Legislation for Firefighters Debating MP's Pay Debating PFAS Contamination Debating Suicide Figures in Australia Debating MP's AllowancesThis tells us everything we need to know about this Government’s priorities. Accurate description of photos showing the difference in attendance in various parliamentary sessions in Australia 1.0
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 1
  • per_device_eval_batch_size: 1
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 1
  • per_device_eval_batch_size: 1
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.0194 500 0.0
0.0388 1000 0.0
0.0583 1500 0.0
0.0777 2000 0.0
0.0971 2500 0.0
0.1165 3000 0.0
0.1360 3500 0.0
0.1554 4000 0.0
0.1748 4500 0.0
0.1942 5000 0.0
0.2137 5500 0.0
0.2331 6000 0.0
0.2525 6500 0.0
0.2719 7000 0.0
0.2913 7500 0.0
0.3108 8000 0.0
0.3302 8500 0.0
0.3496 9000 0.0
0.3690 9500 0.0
0.3885 10000 0.0
0.4079 10500 0.0
0.4273 11000 0.0
0.4467 11500 0.0
0.4661 12000 0.0
0.4856 12500 0.0
0.5050 13000 0.0
0.5244 13500 0.0
0.5438 14000 0.0
0.5633 14500 0.0
0.5827 15000 0.0
0.6021 15500 0.0
0.6215 16000 0.0
0.6410 16500 0.0
0.6604 17000 0.0
0.6798 17500 0.0
0.6992 18000 0.0
0.7186 18500 0.0
0.7381 19000 0.0
0.7575 19500 0.0
0.7769 20000 0.0
0.7963 20500 0.0
0.8158 21000 0.0
0.8352 21500 0.0
0.8546 22000 0.0
0.8740 22500 0.0
0.8934 23000 0.0
0.9129 23500 0.0
0.9323 24000 0.0
0.9517 24500 0.0
0.9711 25000 0.0
0.9906 25500 0.0

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.1
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
7
Safetensors
Model size
305M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for am-azadi/gte-multilingual-base_Fine_Tuned_1e

Finetuned
(45)
this model
Finetunes
1 model