Phish360: A Multimodal Anti-Phishing Dataset

Why Phish360?

Scarcity of Multimodal Datasets

Despite the abundance of anti-phishing research, publicly available multimodal datasets are limited. This scarcity restricts the development and evaluation of models that can leverage different modalities (e.g., URLs, HTML, screenshots) for phishing detection.

Legitimate Data Bias

Many datasets primarily include the homepages of legitimate websites, leading to a bias where legitimate URLs are significantly shorter. Phish360 addresses this by incorporating legitimate login pages for a realistic and balanced distribution.

Low Diversity in Samples

Existing datasets are often collected from a narrow range of sources over a short timeframe. This leads to a lack of diversity, with many datasets containing similar or redundant URLs, reducing their value for real-world applications.

Data Integrity Issues

Existing datasets often suffer from missing content, duplicates, or offline pages. Phish360 guarantees data integrity by pre-validating every sample to ensure the URL, HTML, and screenshot are accessible and correctly rendered.

Contributions

Multimodal Triplet

First dataset to enforce unique (URL, HTML, Image) triplets to eliminate data leakage. This ensures that models are trained on distinct samples, preventing overfitting to duplicates.

Linguistic Diversity

Includes samples in 30+ languages, moving beyond English-only biases. This global coverage ensures that detection models remain robust across different linguistic contexts.

Optimized Processing

Provided in Parquet format for efficient, column-oriented data retrieval. Researchers can load massive datasets in seconds compared to traditional CSV or JSON formats.

Reproducible Benchmarks

Sets a standard baseline for comparing text-based, image-based, and hybrid models. The pre-defined train/test splits allow for fair and consistent performance evaluation across studies.

Our Features

Screenshots

1280x960 resolution available for 100% of samples.

Raw HTML

Full source code captured for 100% of samples.

Full URLs

Complete paths including query parameters.

Rich Metadata

Includes Brand, TLD, SSL status, and more.

Parquet Format

High-performance columnar storage for big data.

Multi-Language

30 Phishing & 27 Legitimate languages.

Time-Spaced

Collected over 2.5 years to capture trends.

Easily Extendable

Modular architecture allows easy addition of new features.

Dataset Statistics

Class Distribution

Linguistic Diversity

COMPARISON

Phishing URL Domain Statistics

Dataset Name	URLs (%)	Domains (%)	TLD (%)	FLD (%)	Subdomains (%)
PWD2016	38.25	17.71	1.40	17.85	2.74
PhishIntention	87.21	42.62	1.56	43.04	24.46
PILWD-134K	86.68	45.23	1.08	46.41	21.35
VanNL126K	100.0	25.91	0.67	26.85	13.65
Phish360 (Ours)	98.26	73.63	6.69	73.86	28.69

Legitimate URLDomain Statistics

Dataset Name	URLs (%)	Domains (%)	TLD (%)	FLD (%)	Subdomains (%)
PWD2016	100.0	92.97	2.46	-	0.65
PhishIntention	87.94	82.98	2.17	86.68	3.21
PILWD-134K	99.23	91.37	0.76	92.57	3.27
VanNL126K	100.0	84.90	2.16	85.68	9.94
Phish360 (Ours)	99.41	88.73	3.07	88.92	7.29

View Full Benchmarks & Download Datasets Parquet Files

How to Use

Load Data

Easily load specific columns using Pandas.

import pandas as pd

# Load specific columns
cols = ['URL', 'full_html', 'BeautifulSoup_text', 'image_path', 'class']
df = pd.read_parquet('phish360.parquet', columns=cols)

Experiment

Use the pre-defined split for reproducible results.

from sklearn.model_selection import train_test_split
import pandas as pd

# 1. Read datasets
phish = pd.read_parquet('Phish360_phish.parquet')
legit = pd.read_parquet('Phish360_legit.parquet')

# 2. Add & Combine
df = pd.concat([phish, legit], ignore_index=True)

# 3. Split (80/20, seed=42)
train, test = train_test_split(df, test_size=0.2, random_state=42, stratify=y)

Experimental Results

CrossPhire Performance on All Benchmark Datasets

Detailed performance metrics of CrossPhire (using ResNet50 and DenseNet121 vision encoders) across five standard phishing datasets.

Dataset	Vision Model	Accuracy	Precision	Recall	F1-Score
PILWD-134K	ResNet50	98.04%	97.88%	98.10%	97.83%
PILWD-134K	DenseNet121	98.07%	97.84%	98.21%	98.83%
VanNL126K	ResNet50	99.42%	99.57%	99.72%	99.63%
VanNL126K	DenseNet121	99.26%	99.49%	99.62%	99.52%
PhishIntention	ResNet50	99.57%	99.50%	99.76%	99.61%
PhishIntention	DenseNet121	99.63%	99.54%	99.83%	99.66%
PWD2016	ResNet50	100.00%	100.00%	100.00%	100.00%
PWD2016	DenseNet121	100.00%	100.00%	100.00%	100.00%
Phish360	ResNet50	97.71%	97.99%	96.29%	96.02%
Phish360	DenseNet121	97.96%	97.90%	96.99%	96.53%

How to Cite

CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification

Ahmad Hani Abdalla Almakhamreh, Ahmet Selman Bozkir

Applied Sciences, 2026, 16(2), 751

https://doi.org/10.3390/app16020751

Authors

AHMET SELMAN BOZKIR (Ph.D)

Data curation, collection - Initial filtering

Ahmad H. A. Almakhamreh

Data cleaning, visualization - exploratory data analysis - Post-filtering