It's very important to note that this model is not production-ready.


The classification task for v1 is split into two stages:

  1. URL features model
    • 96.5%+ accurate on training and validation data
    • 2,436,727 rows of labelled URLs
    • evaluation from v2: slightly overfitted, by perhaps around 0.8%
  2. Website features model
    • 98.4% accurate on training data, and 98.9% accurate on validation data
    • 911,180 rows of 42 features
    • evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns

Training

I applied cross-validation with cv=5 to the training dataset to search for the best hyperparameters. Here's the dict passed to sklearn's GridSearchCV function:

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': ['gbdt', 'dart'],
    'num_leaves': [15, 23, 31, 63],
    'learning_rate': [0.001, 0.002, 0.01, 0.02],
    'feature_fraction': [0.5, 0.6, 0.7, 0.9],
    'early_stopping_rounds': [10, 20],
    'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}

To reproduce the 98.4% accurate model, you can follow the data analysis on the dataset page to filter out the unimportant features. Then train a LightGBM model using the most suited hyperparamters for this task:

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.01,
    'feature_fraction': 0.6,
    'early_stopping_rounds': 10,
    'num_boost_round': 800
}

URL Features

from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")

Website Features

pip install lightgbm
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")
Downloads last month
118
Safetensors
Model size
11.5M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train FredZhang7/malphish-eater-v1