Model Card for Fine-Tuned BERT for Paraphrase Detection
Model Description
This is a fine-tuned version of BERT-base for paraphrase detection, trained on four benchmark datasets: MRPC, QQP, PAWS-X, and PIT. The model is designed for applications such as duplicate content detection, question answering, and semantic similarity analysis. It offers strong recall capabilities, making it effective in identifying paraphrases even in complex sentence structures.
- Developed by: Viswadarshan R R
- Model Type: Transformer-based Sentence Pair Classifier
- Language: English
- Finetuned from:
bert-base-cased
Model Sources
- Repository: Hugging Face Model Hub
- Research Paper: Comparative Insights into Modern Architectures for Paraphrase Detection (Accepted at ICCIDS 2025)
- Demo: (To be added upon deployment)
Uses
Direct Use
- Identifying duplicate questions in customer support and FAQs.
- Improving semantic search in retrieval-based systems.
- Enhancing document deduplication and text similarity applications.
Downstream Use
This model can be further fine-tuned on domain-specific paraphrase datasets for industries such as healthcare, legal, and finance.
Out-of-Scope Use
- The model is monolingual and trained only on English datasets, requiring additional fine-tuning for multilingual tasks.
- May struggle with idiomatic expressions or complex figurative language.
Bias, Risks, and Limitations
Known Limitations
- Higher recall but lower precision: The model tends to over-identify paraphrases, leading to increased false positives.
- Contextual ambiguity: May misinterpret sentences that require deep contextual reasoning.
Recommendations
Users can mitigate the false positive rate by applying post-processing techniques or confidence threshold tuning.
How to Get Started with the Model
To use the model, install transformers and load the fine-tuned model as follows:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the tokenizer and model
model_path = "viswadarshan06/pd-bert"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Encode sentence pairs
inputs = tokenizer("The car is fast.", "The vehicle moves quickly.", return_tensors="pt", padding=True, truncation=True)
# Get predictions
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax().item()
print("Paraphrase" if predicted_class == 1 else "Not a Paraphrase")
Training Details
This model was trained using a combination of four datasets:
- MRPC: News-based paraphrases.
- QQP: Duplicate question detection.
- PAWS-X: Adversarial paraphrases for robustness testing.
- PIT: Short-text paraphrase dataset.
Training Procedure
- Tokenizer: BERT Tokenizer
- Batch Size: 16
- Optimizer: AdamW
- Loss Function: Cross-entropy
Training Hyperparameters
- Learning Rate: 2e-5
- Sequence Length:
- MRPC: 256
- QQP: 336
- PIT: 64
- PAWS-X: 256
Speeds, Sizes, Times
- GPU Used: NVIDIA A100
- Total Training Time: ~6 hours
- Compute Units Used: 80
Testing Data, Factors & Metrics
Testing Data
The model was tested on combined test sets and evaluated using:
- Accuracy
- Precision
- Recall
- F1-Score
- Runtime
Results
BERT Model Evaluation Metrics
Model | Dataset | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | Runtime (sec) |
---|---|---|---|---|---|---|
BERT | MRPC Validation | 88.24 | 88.37 | 95.34 | 91.72 | 1.41 |
BERT | MRPC Test | 84.87 | 85.84 | 92.50 | 89.04 | 5.77 |
BERT | QQP Validation | 87.92 | 81.44 | 86.86 | 84.06 | 43.24 |
BERT | QQP Test | 88.14 | 82.49 | 86.56 | 84.47 | 43.51 |
BERT | PAWS-X Validation | 91.90 | 87.57 | 94.67 | 90.98 | 6.73 |
BERT | PAWS-X Test | 92.60 | 88.69 | 95.92 | 92.16 | 6.82 |
BERT | PIT Validation | 77.38 | 72.41 | 58.57 | 64.76 | 4.34 |
BERT | PIT Test | 86.16 | 64.11 | 76.57 | 69.79 | 0.98 |
Summary
This BERT-based Paraphrase Detection Model demonstrates strong recall capabilities, making it highly effective at identifying paraphrases across varied linguistic structures. While it tends to overpredict paraphrases, it remains a strong baseline for semantic similarity tasks and can be fine-tuned further for domain-specific applications.
Citation
If you use this model, please cite:
@inproceedings{viswadarshan2025paraphrase,
title={Comparative Insights into Modern Architectures for Paraphrase Detection},
author={Viswadarshan R R, Viswaa Selvam S, Felcia Lilian J, Mahalakshmi S},
booktitle={International Conference on Computational Intelligence, Data Science, and Security (ICCIDS)},
year={2025},
publisher={IFIP AICT Series by Springer}
}
Model Card Contact
📧 Email: [email protected]
🔗 GitHub: Viswadarshan R R
- Downloads last month
- 5
Model tree for viswadarshan06/pd-bert
Base model
google-bert/bert-base-cased