SOTA Accuracy (Closed set)
Malevis dataset is aimed for researchers to supply a RGB based ground truth dataset in order to evaluate their vision based multi-class malware recognition studies.
For this purpose, we supply a corpus involving byte images of 26 (25+1) classes. Here, 1 class represents the "legitimate" samples while the rest of the 25 classes correspond to different malware types. To construct this corpus, we first extracted the binary images from malware files (supplied by Comodo Inc) in 3 channel RGB form by using bin2png script developed by Sultanik. Following to having vertically long images, we then resized the images in 2 different square sized resolution (224x224 and 300x300 pixels)
Malevis dataset involves totally 9100 training and 5126 validation RGB images. All the training classes involve 350 image samples while validation set have various number of images. Since the nature of the malware detection/recognition is based on discriminating legitimate ones from the malware, we provided a fairly larger set for "legitimate" samples during validation example (350 vs 1482)
The directory structure of the dataset has been formed in order to be further used without an extra effort. In this regard, you can employ it in several Deep Learning frameworks such as Caffe©, Pytorch©, Tensorflow© and Keras©.
Regarding the problem as a close set form (i.e. excluding the legitimate samples), in our paper, Densenet based convolutional neural networks have achieved 97.48% accuracy on Malevis validation set. The open set (i.e. including the legitimate samples for both training and validation) version of validation results have not yet published. In this web page, we will publish the state of the art results obtained from Malevis dataset on a regular basis. Moreover, we are planing to list the scientific papers citing this dataset.
This dataset has been collected by Multimedia Information Lab of Hacettepe University Computer Engineering cooperated with COMODO Inc. Note that, Malevis is intended to be used only for academic purposes. Please read the paper provided at the menu before proceeding. If you need the binaries themselves, you must contact with us via E-mail.