Classification of Packed Executables for Accurate Computer Virus Detection
Project Overview:
Modern malware writers pack malware executable code to avoid detecting by malware signatures. To detect packed malware executables, antivirus software first has to unpack them. This unpacking has to be done dynamically and is computationally costly. Note that a majority of benign executable files are not packed. Therefore, unpacking every input executable file is not an efficient approach. Therefore, efficient and accurate packed executable detection has importance for commercial antivirus software. In this project, we aim to develop an accurate and efficient packed executable file detector, which is based on static features.
Data set:
To evaluate the accuracy of our developed packer classifier, we will get non-packed executables from a fresh Windows copy. An overwhelming majority of these executables are non-packed. We will create packed files by using openly available executable packers to pack non-packed executables collected from a fresh Windows copy.
Feature Preprocessing:
This part will be handled by Zubair. In this part, we will try to design a set of features that can be leveraged for accurate and efficient packed executable detection. The intuition for features is that the number of sections in an executable gets reduced after packing. Furthermore, the “randomness” of byte distribution also increases significantly after packing. Towards this end, we plan to use entropy of different portions of executable file as features.
Data Mining:
This part will be headed by Osama Al-Jawad. In this part we will try different data mining techniques to find the best model that can detect packed executables. The goal is to find the most accurate and efficient model. We will also identify the important attributes to minimize the redundancy and computational overheads. The final goal is to classify each executable file as packed or non-packed (2 class classification problem). In terms of accuracy, our goal will be to minimize the probability of false negatives.
Visualization:
Pravesh will head the web-enabled visualization part. Since we are working on packer detection, we would want to comprehensively analyze and visualize the accuracy of each data mining algorithm for the features we identified from our data. Towards this end, we will use the Google Chart API as our web interface to visualize and represent accuracy of each model through well defined ROC areas in our graphs hosted on the MSU CSE web server.