SentenceTransformer based on hon9kon9ize/bert-large-cantonese-nli

This is a sentence-transformers model finetuned from hon9kon9ize/bert-large-cantonese-nli on the yue-stsb, stsb and C-MTEB/STSB dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: hon9kon9ize/bert-large-cantonese-nli
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    '一個細路女同一個細路仔喺度睇書。',
    '一個大啲嘅小朋友玩緊公仔,望住窗外。',
    '有個男人彈緊結他。',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric sts-dev sts-test
pearson_cosine 0.7983 0.7638
spearman_cosine 0.7996 0.7605

Training Details

Training Dataset

yue-stsb

  • Dataset: yue-stsb at 40cea5d

  • Size: 5,749 training samples

  • Columns: sentence1, sentence2, and score

  • Approximate statistics based on the first 1000 samples:

    sentence1 sentence2 score
    type string string float
    details
    • min: 7 tokens
    • mean: 12.24 tokens
    • max: 40 tokens
    • min: 7 tokens
    • mean: 12.21 tokens
    • max: 30 tokens
    • min: 0.0
    • mean: 0.45
    • max: 1.0
  • Samples:

    sentence1 sentence2 score
    架飛機正準備起飛。 一架飛機正準備起飛。 1.0
    有個男人吹緊一支好大嘅笛。 有個男人吹緊笛。 0.76
    有個男人喺批薩上面灑碎芝士。 有個男人將磨碎嘅芝士灑落一塊未焗嘅批薩上面。 0.76
  • Loss: CosineSimilarityLoss with these parameters:

    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    
  • Size: 16,729 training samples

  • Columns: sentence1, sentence2, and score

  • Approximate statistics based on the first 1000 samples:

    sentence1 sentence2 score
    type string string float
    details
    • min: 5 tokens
    • mean: 20.29 tokens
    • max: 74 tokens
    • min: 6 tokens
    • mean: 20.36 tokens
    • max: 76 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:

    sentence1 sentence2 score
    奧巴馬登記咗參加奧巴馬醫保。 美國人爭住喺限期前登記參加奧巴馬醫保計劃, 0.24
    Search ends for missing asylum-seekers Search narrowed for missing man 0.28
    檢察官喺五月突然轉軚,要求公開驗屍報告,因為有利於辯方嘅康納·彼得森驗屍報告部分內容已經洩露畀媒體。 佢哋要求公開驗屍報告,因為彼得森腹中胎兒嘅驗屍報告中,對辯方有利嘅部分已經洩露俾傳媒。 0.8
  • Loss: CosineSimilarityLoss with these parameters:

    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 4,458 evaluation samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 8 tokens
    • mean: 19.76 tokens
    • max: 53 tokens
    • min: 7 tokens
    • mean: 19.65 tokens
    • max: 53 tokens
    • min: 0.0
    • mean: 0.42
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    有個戴住安全帽嘅男人喺度跳舞。 有個戴住安全帽嘅男人喺度跳舞。 1.0
    一個細路仔騎緊馬。 個細路仔騎緊匹馬。 0.95
    有個男人餵老鼠畀條蛇食。 個男人餵咗隻老鼠畀條蛇食。 1.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • bf16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sts-dev_spearman_cosine sts-test_spearman_cosine
0.7634 100 0.0549 0.0403 0.7895 -
1.5267 200 0.027 0.0368 0.7941 -
2.2901 300 0.0187 0.0349 0.7968 -
3.0534 400 0.0119 0.0354 0.8004 -
3.8168 500 0.0076 0.0359 0.7996 -
4.0 524 - - - 0.7605

Framework Versions

  • Python: 3.11.2
  • Sentence Transformers: 3.3.1
  • Transformers: 4.46.1
  • PyTorch: 2.4.0+cu121
  • Accelerate: 1.0.1
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
54
Safetensors
Model size
326M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for hon9kon9ize/bert-large-cantonese-sts

Finetuned
(1)
this model
Finetunes
1 model

Datasets used to train hon9kon9ize/bert-large-cantonese-sts

Space using hon9kon9ize/bert-large-cantonese-sts 1

Evaluation results