SWPS40 Dataset - A Benchmark/Ground Truth Dataset for Assesment of Structure and Vision based Web Page Similarity

SWPS40 (Similar Web PageS) dataset is aimed for researchers to supply a ground truth dataset to verify their ranking results based on web page visual similarity. For this purpose, we have collected screenshots and HTML+CSS+Js files of 40 different web pages from different contexts and sectors.

The main goal of this dataset is to provide ground truth for visual similarity based rankings collected from many participants. The web page pairs in the dataset were scored by 312 different participants. During the study, each participant scored 100 different page pairs yielding totally 31200 individual scores. In this way, 40 votings have been collected for each page pair (e.g. P₁ and P₄) In this way, it was aimed to generate a statistically significant ground truth rankings.

For whom?

For those who are interested in web page visual similarity.
Researchers may use SWPS40 dataset in order to measure their similarity based ranking performance regarding to the web page visual similarity. It should be noted, SPWS40 dataset has not been intended to assess vision based anti-phishing studies. It was rather built for comparing visual similarity algorithms against average human similarity judgment (via collecting similarity scores from 312 participants)

[1] Bozkir, A.S., Akcapinar Sezer, E., International Journal of Human Computer Studies, vol. 110, 2018

Ahmet Selman Bozkır (Ph.D), Ebru Akcapinar Sezer (Prof.)

SWPS40 dataset is composed of three main assets listed below:

HTML + JS + CSS files of the 40 web pages
Full page screenshots + 1024*1024 px sized cropped ones that were used in [1]
CSV file based ground truth scores collected from 312 participants

So you can get and utilize the structural information embedded in DOM tree. Moreover, screenshots of the corresponding web pages were taken in order to enable vision based comparision.

Important notes for the files

In order to view the home pages (being offline is highly recommended) list please visit the "index.html" file. All the web page titles are listed.
Ground truth files are composed of 2 files: "PageURLs.csv" and "Scores.csv". While the original page urls were recorded in the former file, the scores were stored in the latter one.
The structure of the "Scores.csv" file is as follows:
"UserID"-"PageIndexA"-"PageIndexB"-"Score" According to this definition the field of userID corresponds to the number of the participant.
PageIndexA and PageIndexB contain the index number of the pages ranging from 1 to 40.
The field of "Scores" involves the respective participant's bias free score ranging between 5-100.

Click to fill the form for downloading the dataset.
Happy researches!
Last Update - 18.10.2018