In this project, we investigate a wide range of structural features, extractable from Windows executable files, which can be used to distinguish between packed and non-packed files. The benefits that we are aiming to get from the proposed approach are:
1. More robust non-signature based packer detector,
2. Lower false negatives, even with higher false positives, and
3. More efficient with low computational complexity.
Windows executable files have a standard format called Microsoft Portable Executable and Common Object File Format (PECOFF). Figure below shows the overview of the structure of a Windows executable file.
We note that a Windows executable files contains several headers that act as prelude to the core content contained in the sections later. The fields of different headers in files provide a summary of their structure. One interesting feature that we highlight here are the number of entries in the import address table. The red portions of the histogram below are for packed files and blue portions of the histogram files are for non-packed files. We note that non-packed files typically import more entries from external DLLs compared to packed files.
We also compute the entropy of different sections of files. In general, we found that entropy values for packed files were significantly higher than that for non-packed files.
We extracted more than one hundred features from Windows executable files. To quantitatively select features, we consider information gain values for the features. For example, the two features mentioned above had the highest information gain values for our data set (0.88 and 0.809 respectively).
We now input the selected features into several data mining algorithms for classification. The details are provided in the following text.