Today, due to a large number of parameters and the risk of overfitting, training of deep learning models usually requires large-scale datasets. This fact inherently holds for everlasting phishing attacks. Nonetheless, our study clearly laid out the following problems:
- Scarcity: Despite the existence of numerous studies in the literature, the number of publicly available URL based datasets is limited.
- Class imbalance: Some of the datasets involve a class-imbalance problem. In other words, the fraction of legitimate URLs is far more than phishing ones.
- Home pages only legitimate sites Legitimate samples in several datasets mainly consist of the home pages of the corresponding domain. Thus, the sequence length of legitimate URLs is significantly shorter than phishing ones yielding easy-to-detect characteristics.
- Low diversity: In many studies, phishing URLs are gathered from well known services such as Phishtank or OpenPhish within a short period of time yielding low diversity since these services report many similar URLs daily.