Phish360 is a novel multimodal anti-phishing dataset featuring 10,748 real-world phishing and legitimate samples collected between 2020 and 2023. This dataset is meticulously designed to drive innovation and research in multimodal phishing detection by integrating visual and semantic features.
Despite the abundance of anti-phishing research, publicly available multimodal datasets are limited. This scarcity restricts the development and evaluation of models that can leverage different modalities (e.g., URLs, HTML, screenshots) for phishing detection.
Many datasets primarily include the homepages of legitimate websites, leading to a bias where legitimate URLs are significantly shorter. Phish360 addresses this by incorporating legitimate login pages for a realistic and balanced distribution.
Existing datasets are often collected from a narrow range of sources over a short timeframe. This leads to a lack of diversity, with many datasets containing similar or redundant URLs, reducing their value for real-world applications.
Existing datasets often suffer from missing content, duplicates, or offline pages. Phish360 guarantees data integrity by pre-validating every sample to ensure the URL, HTML, and screenshot are accessible and correctly rendered.
First dataset to enforce unique (URL, HTML, Image) triplets to eliminate data leakage. This ensures that models are trained on distinct samples, preventing overfitting to duplicates.
Includes samples in 30+ languages, moving beyond English-only biases. This global coverage ensures that detection models remain robust across different linguistic contexts.
Provided in Parquet format for efficient, column-oriented data retrieval. Researchers can load massive datasets in seconds compared to traditional CSV or JSON formats.
Sets a standard baseline for comparing text-based, image-based, and hybrid models. The pre-defined train/test splits allow for fair and consistent performance evaluation across studies.
1280x960 resolution available for 100% of samples.
Full source code captured for 100% of samples.
Complete paths including query parameters.
Includes Brand, TLD, SSL status, and more.
High-performance columnar storage for big data.
30 Phishing & 27 Legitimate languages.
Collected over 2.5 years to capture trends.
Modular architecture allows easy addition of new features.
| Dataset Name | URLs (%) | Domains (%) | TLD (%) | FLD (%) | Subdomains (%) |
|---|---|---|---|---|---|
| PWD2016 | 38.25 | 17.71 | 1.40 | 17.85 | 2.74 |
| PhishIntention | 87.21 | 42.62 | 1.56 | 43.04 | 24.46 |
| PILWD-134K | 86.68 | 45.23 | 1.08 | 46.41 | 21.35 |
| VanNL126K | 100.0 | 25.91 | 0.67 | 26.85 | 13.65 |
| Phish360 (Ours) | 98.26 | 73.63 | 6.69 | 73.86 | 28.69 |
| Dataset Name | URLs (%) | Domains (%) | TLD (%) | FLD (%) | Subdomains (%) |
|---|---|---|---|---|---|
| PWD2016 | 100.0 | 92.97 | 2.46 | - | 0.65 |
| PhishIntention | 87.94 | 82.98 | 2.17 | 86.68 | 3.21 |
| PILWD-134K | 99.23 | 91.37 | 0.76 | 92.57 | 3.27 |
| VanNL126K | 100.0 | 84.90 | 2.16 | 85.68 | 9.94 |
| Phish360 (Ours) | 99.41 | 88.73 | 3.07 | 88.92 | 7.29 |
Easily load specific columns using Pandas.
import pandas as pd
# Load specific columns
cols = ['URL', 'full_html', 'BeautifulSoup_text', 'image_path', 'class']
df = pd.read_parquet('phish360.parquet', columns=cols)
Use the pre-defined split for reproducible results.
from sklearn.model_selection import train_test_split
import pandas as pd
# 1. Read datasets
phish = pd.read_parquet('Phish360_phish.parquet')
legit = pd.read_parquet('Phish360_legit.parquet')
# 2. Add & Combine
df = pd.concat([phish, legit], ignore_index=True)
# 3. Split (80/20, seed=42)
train, test = train_test_split(df, test_size=0.2, random_state=42, stratify=y)
Detailed performance metrics of CrossPhire (using ResNet50 and DenseNet121 vision encoders) across five standard phishing datasets.
| Dataset | Vision Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| PILWD-134K | ResNet50 | 98.04% | 97.88% | 98.10% | 97.83% |
| DenseNet121 | 98.07% | 97.84% | 98.21% | 98.83% | |
| VanNL126K | ResNet50 | 99.42% | 99.57% | 99.72% | 99.63% |
| DenseNet121 | 99.26% | 99.49% | 99.62% | 99.52% | |
| PhishIntention | ResNet50 | 99.57% | 99.50% | 99.76% | 99.61% |
| DenseNet121 | 99.63% | 99.54% | 99.83% | 99.66% | |
| PWD2016 | ResNet50 | 100.00% | 100.00% | 100.00% | 100.00% |
| DenseNet121 | 100.00% | 100.00% | 100.00% | 100.00% | |
| Phish360 | ResNet50 | 97.71% | 97.99% | 96.29% | 96.02% |
| DenseNet121 | 97.96% | 97.90% | 96.99% | 96.53% |
Ahmad Hani Abdalla Almakhamreh, Ahmet Selman Bozkir
Applied Sciences, 2026, 16(2), 751