am-azadi's picture
Upload folder using huggingface_hub
640b6b3 verified
|
raw
history blame
19.1 kB
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:21769
  - loss:MultipleNegativesRankingLoss
base_model: am-azadi/bilingual-embedding-large_Fine_Tuned_2e
widget:
  - source_sentence: >-
      Amen.. This Quran was found at the bottom of the sea Has become a rock but
      still intact subhanallah, hopefully those who like it, comment amen and
      share this post sincerely the sustenance tomorrow morning will be abundant
      from the opposite direction unexpected.amen  اهيه
    sentences:
      - >-
        Mexico deserves an Oscar for the coffin dance The video of uniformed men
        doing "the coffin dance" was recorded in Colombia, not in Mexico
      - >-
        The Koran was found at the bottom of the sea already turned into a rock
        but still intact This is a dictionary covered in crystal and is a work
        of art by an American artist
      - >-
        Video purported to be a video celebrating the inauguration of Hamas' new
        office in the Indian state of Kerala False, this claim is a video
        celebrating the inauguration of the Hamas office in India
  - source_sentence: ' P Stay alert ! 6710 A Japanese man killed his friend just because he didn give him 6x scope in PUBG TAG A PUBG LOVER'
    sentences:
      - >-
        Japanese man killed friend over video game Japanese man killed his
        friend in row over video game?
      - >-
        This photo shows the Glastonbury festival after Greta Thunberg's
        participation in 2022 This Glastonbury festival photo is from 2015, not
        2022 after Greta Thunberg's speech
      - >-
        Footage of damaged building was shot in Russia in 2018 Footage shows
        Ukraine in 2022, not Russia in 2018
  - source_sentence: >-
      This is Manoj Tiwari, MP - North East Delhi I am busy bursting
      firecrackers, after bursting firecrackers all night, I wake up in the
      morning and say, "Today my eyes are burning in Delhi". Manoj Tiwari  Today
      my eyes are burning in Delhi, and yours? ,
    sentences:
      - >-
        Images show recent unrest and brutality in Uganda None of these images
        are related to Uganda’s ongoing political troubles
      - >-
        The photo shows Indian politician Manoj Tiwari lighting fireworks in
        Delhi during smog crisis. This image of an Indian lawmaker lighting a
        firecracker has circulated in reports since 2014
      - >-
        World Economic Forum tweet asks if age of consent should be lowered to
        13 Fabricated World Economic Forum tweet about 'lowering age of consent'
        misleads online
  - source_sentence: ' : He Yunshi was arrested, as expected, but better than expected even faster. . . 6-1 LICEN Fang Bomei BOOT UML'
    sentences:
      - >-
        In Chile they have just expropriated pensions It is not true that in
        Chile “the pensions have just been expropriated”
      - >-
        Four British Airways airline pilots have died from the covid-19 vaccine
        British Airways ruled out link between pilot deaths and vaccinations
      - >-
        Hong Kong Pro-democracy artist Denise Ho arrested in September 2021 Old
        photos of Hong Kong pro-democracy activist shared in false 'news' of her
        arrest
  - source_sentence: >-
      Uuuuu mepa that they killed the real bald guy EXCLUSIVE What are you doing
      bald, go getting into it jonca that the 12 wants to take pictures with you
      at any time 14:04 ✓ they found 2 contact cards for this number re add them
      to your contacts? T SEE CONTACT CARDS CELL PHONES TURNED OFF THEY LOOK FOR
      THEM EVERYWHERE THE "12" IS LOOKING FOR THEM "serne HD MRASSIA LEGAL: 11
      2159 6256 FOR POLICE COMPLAINTS: 11 2159 6256 HERE FOR POLICE COMPLAINTS:
      11-
    sentences:
      - >-
        The elected mayor of Medellín does not like ESMAD. WHY WILL IT BE? The
        original video shows Daniel Quintero in a demonstration against violence
        in Bogotá
      - >-
        Warning in Paris about stroke in children in post-covid vaccine era The
        stroke campaign in France is not about vaccinating children against
        covid-19
      - >-
        They find Diego Molina murdered in his apartment, the skinny from the
        funeral home who took photos with Diego Armando Maradona The images of a
        lacerated body are not of the person who was photographed with the
        corpse of Maradona
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on am-azadi/bilingual-embedding-large_Fine_Tuned_2e

This is a sentence-transformers model finetuned from am-azadi/bilingual-embedding-large_Fine_Tuned_2e. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Uuuuu mepa that they killed the real bald guy EXCLUSIVE What are you doing bald, go getting into it jonca that the 12 wants to take pictures with you at any time 14:04 ✓ they found 2 contact cards for this number re add them to your contacts? T SEE CONTACT CARDS CELL PHONES TURNED OFF THEY LOOK FOR THEM EVERYWHERE THE "12" IS LOOKING FOR THEM "serne HD MRASSIA LEGAL: 11 2159 6256 FOR POLICE COMPLAINTS: 11 2159 6256 HERE FOR POLICE COMPLAINTS: 11-',
    'They find Diego Molina murdered in his apartment, the skinny from the funeral home who took photos with Diego Armando Maradona The images of a lacerated body are not of the person who was photographed with the corpse of Maradona',
    'The elected mayor of Medellín does not like ESMAD. WHY WILL IT BE? The original video shows Daniel Quintero in a demonstration against violence in Bogotá',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 21,769 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 4 tokens
    • mean: 122.97 tokens
    • max: 512 tokens
    • min: 17 tokens
    • mean: 38.24 tokens
    • max: 109 tokens
  • Samples:
    sentence_0 sentence_1
    NEW HANDLING OF ALERT While the achieves 6,101,968 votes (i.e. 26.8%), the Ministry of the Interior only gives it 5,836,202 votes (i.e. 25.7%) to artificially make 's party appear in the lead . Hello Council of State? The Ministry of the Interior manipulated the results of the legislative elections Legislative: why are the results of the 1st round contested by the Nupes?
    <3<3... Civil Registry Offices in Brazil: The only source that does not lie, as it issues all death certificates daily, for all reasons. This source cannot be disputed by anyone. Only they can say for sure, how many people die each day, and the reason for death. The rest is fake news. Via Jose Mendes Junior Updating... Deaths in Brazil: July 2019 - 119,390 (without pandemic) July 2020 - 113,475 (with pandemic) Source: transparencia.registrocivil.org.br... Now what are they going to say???? More deaths were recorded in Brazil in July 2019, before the pandemic, than in July 2020, during the new coronavirus pandemic. Publications use partial data on deaths recorded in July 2020
    Zimbabwe Police are taking disciplinary action with a church that refused to take closure instructions to prevent the spread of Coronavirus. Worshipers beaten in Zimbabwe for failing to comply with coronavirus assembly ban No, worshipers have not been beaten by police in Zimbabwe for gathering during the coronavirus outbreak
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.0459 500 0.0148
0.0919 1000 0.0066
0.1378 1500 0.0245
0.1837 2000 0.0184
0.2297 2500 0.0174
0.2756 3000 0.0053
0.3215 3500 0.025
0.3675 4000 0.0105
0.4134 4500 0.0054
0.4593 5000 0.0076
0.5053 5500 0.0085
0.5512 6000 0.0104
0.5972 6500 0.0208
0.6431 7000 0.0072
0.6890 7500 0.0084
0.7350 8000 0.0053
0.7809 8500 0.0052
0.8268 9000 0.0064
0.8728 9500 0.0074
0.9187 10000 0.0083
0.9646 10500 0.008

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}