bilingual-embedding-large

Bilingual-embedding is the Embedding Model for bilingual language: french and english. This model is a specialized sentence-embedding trained specifically for the bilingual language, leveraging the robust capabilities of XLM-RoBERTa, a pre-trained language model based on the XLM-RoBERTa architecture. The model utilizes xlm-roberta to encode english-french sentences into a 1024-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of english-french sentences, reflecting both the lexical and contextual layers of the language.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Training and Fine-tuning process

Stage 1: NLI Training

Dataset: [(SNLI+XNLI) for english+french]
Method: Training using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.

Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark

Dataset: [STSB-fr and en]
Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library.

Stage 4: Advanced Augmentation Fine-tuning

Dataset: STSB with generate silver sample from gold sample
Method: Employed an advanced strategy using Augmented SBERT with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy.

Usage:

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

sentences = ["Paris est une capitale de la France", "Paris is a capital of France"]

model = SentenceTransformer('Lajavaness/bilingual-embedding-large', trust_remote_code=True)
print(embeddings)

Evaluation

TODO

Citation

@article{conneau2019unsupervised,
  title={Unsupervised cross-lingual representation learning at scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

@article{reimers2019sentence,
   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
   author={Nils Reimers, Iryna Gurevych},
   journal={https://arxiv.org/abs/1908.10084},
   year={2019}
}

@article{thakur2020augmented,
  title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
  author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
  journal={arXiv e-prints},
  pages={arXiv--2010},
  year={2020}

Downloads last month: 22,489

Safetensors

Model size

560M params

Tensor type

F32

Inference Providers NEW

Sentence Similarity

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for Lajavaness/bilingual-embedding-large

Finetunes

3 models

Spaces using Lajavaness/bilingual-embedding-large 5

Evaluation results

v_measure on MTEB AlloProfClusteringP2P
test set self-reported

65.300
v_measures on MTEB AlloProfClusteringP2P
test set self-reported

0.632560011824588,0.6345771823814063,0.6333686484625257,0.6508206816667124,0.6378451181543632
v_measure on MTEB AlloProfClusteringS2S
test set self-reported

55.368
v_measures on MTEB AlloProfClusteringS2S
test set self-reported

0.5262468095085737,0.586151012721014,0.5192907959178751,0.5610730679809162,0.6360060059791816
map on MTEB AlloprofReranking
test set self-reported

73.631
mrr on MTEB AlloprofReranking
test set self-reported

74.697
nAUC_map_diff1 on MTEB AlloprofReranking
test set self-reported

56.611
nAUC_map_max on MTEB AlloprofReranking
test set self-reported

21.353
nAUC_mrr_diff1 on MTEB AlloprofReranking
test set self-reported

55.983
nAUC_mrr_max on MTEB AlloprofReranking
test set self-reported

22.297

View on Papers With Code