Model Description
This model is a fine-tuned version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for sentence similarity tasks. It was trained on the mteb/stsbenchmark-sts dataset to evaluate the similarity between sentence pairs.
Model Type: Sequence Classification (Regression) Pre-trained Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 Fine-Tuning Dataset: mteb/stsbenchmark-sts Task: Sentence similarity (regression) Training Details Training Objective: To predict the similarity score between pairs of sentences. Training Data: mteb/stsbenchmark-sts, which contains sentence pairs with similarity scores. Number of Labels: 1 (regression) Epochs: 2 Batch Size: 8 Learning Rate: 2e-5 Weight Decay: 0.01 Evaluation The model was evaluated using Pearson correlation on the validation set of the mteb/stsbenchmark-sts dataset. Results indicate how well the model predicts similarity scores between sentence pairs.
Usage
To use this model for sentence similarity, follow these steps:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the fine-tuned model
model = AutoModelForSequenceClassification.from_pretrained("./paraphraser_model")
tokenizer = AutoTokenizer.from_pretrained("./paraphraser_model")
sentences = ["The quick brown fox jumps over the lazy dog.", "A fast dark-colored fox leaps over a sleeping dog."]
encoded_input = tokenizer(sentences[0], sentences[1], return_tensors="pt", truncation=True, padding='max_length', max_length=128)
# Compute Similarity Score:
import torch
import torch.nn.functional as F
# Perform inference
with torch.no_grad():
model_output = model(**encoded_input)
logits = model_output.logits
similarity_score = F.sigmoid(logits).item()
print(f"Similarity score between the two sentences: {similarity_score}")
# Mean Pooling Function:
If using the model for generating sentence embeddings, you can use the following mean pooling function:
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains the token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
return sum_embeddings / sum_mask
# Limitations
Domain Specificity: The model is fine-tuned on the mteb/stsbenchmark-sts dataset and may perform differently on other types of text or datasets.
Biases: As with any model trained on human language data, it may inherit and reflect biases present in the training data.
# Future Work
Potential improvements include fine-tuning on additional datasets, experimenting with different architectures or hyperparameters, and incorporating additional training techniques to improve performance and robustness.
Citation
If you use this model in your research, please cite it as follows:
@inproceedings{your_paper,
title={Fine-Tuned Paraphrase-Multilingual-MiniLM-L12-v2 for Sentence Similarity},
author={Your Name},
year={2024},
publisher={Your Institution}
}
# License
This model is licensed under the MIT License. See the LICENSE file for more information.
- Downloads last month
- 6