Update README.md

721a44b verified 3 months ago

5.51 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- cybersecurity
	- cyber
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	license: mit
	base_model:
	- google/canine-c
	metrics:
	- recall
	- accuracy
	- precision
	- f1
	datasets:
	- Anvilogic/Embedder-Typosquat-Training-Dataset
	---

	# SentenceTransformer

	This is a [sentence-transformers](https://www.SBERT.net) model fine tuned for typosquat detection. Given a domain and a typosquat of that domain,
	the embedding vectors produced by this model for both domains should be extremely similar by standard similarity metrics (e.g. cosine similarity).
	The model will enable you to search your own custom lists of domains to detect phishing or spearphishing on your organization's specific websites
	by searching for potential typosquat targets given a suspicious domain. These targets can be then fed to our detection models, also included in this
	collection.

	Two interesting observations while training this model:
	* Off the shelf embedding models like MPNet will artificially inflate similarity between similar or related concepts in domains. For example `android.com` and `google.com` are connected since Google owns Android, but the domains are dissimilar in the context of typosquatting.
	* For typosquatting detection, character-level tokenization significantly outperforms classical tokenization methods.

	## Model Details

	### Model Description

	- Developed by: Anvilogic
	- Model Type: Sentence Transformer
	- Maximum Sequence Length: 2048 tokens
	- Output Dimensionality: 768 tokens
	- Similarity Function: Cosine Similarity
	- Finetuned from model: [CANINE-c](https://huggingface.co/google/canine-c)
	- Language(s) (NLP): Multilingual
	- License: MIT

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False}) with Transformer model: CanineModel
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("Anvilogic/Embedder-typosquat-detect-Canine")
	# Run inference
	sentences = [
	'google.com',
	"anvilogic.com",
	'youtube.com',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```
	### Downstream Usage
	This embedding model serves as a preliminary filter to identify potential typosquatted domains.

	- Generate Embeddings : Represent legitimate domains as dense vectors.
	- Similarity Scoring : Calculate cosine similarity to find close matches to a target domain.
	- Cross-Encoder Confirmation : Pass similar pairs to a cross-encoder for final validation.
	This layered approach efficiently narrows down and confirms typosquatting candidates for cybersecurity tasks.

	## Training Details

	### Framework Versions
	- Python: 3.10.14
	- Sentence Transformers: 3.2.1
	- Transformers: 4.46.2
	- PyTorch: 2.2.2
	- Tokenizers: 0.20.3

	### Training Data

	The model was fine-tuned using [Anvilogic/Embedder-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/Embedder-Typosquat-Training-Dataset), which contains pairs of domain names and their similarity labels.
	The dataset was filtered and converted to the parquet format for efficient processing.

	### Training Procedure
	The model was optimized using [MNRLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss)



	#### Training Hyperparameters
	- Model Architecture: Embedder fine-tuned from [CANINE-c](https://huggingface.co/google/canine-c)
	- Batch Size: 64
	- Epochs: 3
	- Learning Rate: 5 e-5
	- Warmup Steps: 100
	- Weight Decay: 0.01

	## Evaluation

	In the final evaluation after training, the model achieved the following metrics on the test set:

	[Information Retrieval Evaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator)
	```json
	info_retr_eval_dot_accuracy@1 : 0.9938,
	info_retr_eval_dot_accuracy@3 : 0.9991,
	info_retr_eval_dot_accuracy@5 : 0.9999,
	info_retr_eval_dot_accuracy@10 : 1.0,
	info_retr_eval_dot_precision@1 : 0.9938,
	info_retr_eval_dot_precision@3 : 0.333,
	info_retr_eval_dot_precision@5 : 0.2,
	info_retr_eval_dot_precision@10 : 0.1,
	info_retr_eval_dot_recall@1 : 0.9938,
	info_retr_eval_dot_recall@3 : 0.9991,
	info_retr_eval_dot_recall@5 : 0.9999,
	info_retr_eval_dot_recall@10 : 1.0,
	info_retr_eval_dot_ndcg@10 : 0.9974,
	info_retr_eval_dot_mrr@10 : 0.9965,
	info_retr_eval_dot_map@100 : 0.9965
	```
	[Paraphrase Mining Evaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#paraphraseminingevaluator)
	```json
	para_mine_eval_average_precision : 0.9307,
	para_mine_eval_f1 : 0.9113,
	para_mine_eval_precision : 0.9197,
	para_mine_eval_recall : 0.9031,
	para_mine_eval_threshold : 0.6409
	```