THE ULTIMATE
PHISHING URL DATASET
FOR NLP BASED DEEP LEARNING

Grambeddings is the largest (800K), balanced and real world phishing & legitimate dataset involving labeled web URLs. Its main purpose is to provide researchers a real world sampled phishing URL dataset for their NLP based future research.

Motivation

Today, due to a large number of parameters and the risk of overfitting, training of deep learning models usually requires large-scale datasets. This fact inherently holds for everlasting phishing attacks. Nonetheless, our study clearly laid out the following problems:

  • Scarcity: Despite the existence of numerous studies in the literature, the number of publicly available URL based datasets is limited.
  • Class imbalance: Some of the datasets involve a class-imbalance problem. In other words, the fraction of legitimate URLs is far more than phishing ones.
  • Home pages only legitimate sites Legitimate samples in several datasets mainly consist of the home pages of the corresponding domain. Thus, the sequence length of legitimate URLs is significantly shorter than phishing ones yielding easy-to-detect characteristics.
  • Low diversity: In many studies, phishing URLs are gathered from well known services such as Phishtank or OpenPhish within a short period of time yielding low diversity since these services report many similar URLs daily.

Contributions

400K+400K LARGE AND LONG TERM COLLECTED DATASET

Our links were crawled in the long term between May 2019 to June 2021. In this way, we avoided to include duplicate URLs.The number of URLs were adjusted in order to create a balanced dataset such that 400K phishing and 400K legitimate samples are involved. Our dataset was collected in 2.5 years.

PERIODICAL SAMPLING VIA CUSTOM CRAWLER

During our data collection, the time in between each URL collection session was kept as one week to avoid repeating records. Moreover, we have designed and implemented our custom crawler to select and filter legitimate URLs to create a realistic sampling.

SIMILARITY REDUCTION

As a post-process, we carried out a manual filtering process to filter out the URLs that have almost identical domain information. In this way, the quality of the dataset were doubly checked.

Our Features

Total Size

Grambeddings dataset is a bi-class dataset involves 800000 labeled samples in total. The samples are distributed equally to provide equal chance for both classes

Easy to Use

The dataset is designed in order to be easily used with various machine learning frameworks such as Sklearn, Pytorch, Tensorflow and Keras

Diversity Matters

Our phishing samples have 86.21 string length on average whereas legitimate cases involves 46.43 on average. Number of our unique phishing domains is the largest with 128.119

COMPARISON

Authors

FIRAT C. DALGIÇ

Crawling

AHMET SELMAN BOZKIR (Ph.D)

Curation - Collection

MURAT AYDOS (Ph.D)

Reviewer