In 2008 alone, 5,491 new software vulnerabilities were documented, 1.6 million new malware signatures were developed, 245 million new attacks were reported, and more than 1 trillion dollars were lost in revenues due to malware outbreaks. Most of the commercial anti-virus software are unable to detect new or previously unknown malware. Furthermore, the computational overheads of signature matching are increasing because the size of signature database is increasing. It is noteworthy that most new malware are not written from scratch. In fact, 50% new malware are simply re‐packed versions of known malware. Moreover, 92% malware use some kind of packing techniques to hide malicious code.
The only way to deal with the plethora of re-packed versions malware is to unpack a malware before scanning. Some commercial anti-virus software deploy dynamic unpacking techniques to extract the hidden malicious code inside malware. However, dynamic unpacking is computationally costly and many times several possible variations of unpacked are generated. Moreover, most benign executable files are non-packed in the first place. For a typical clean system, an anti-virus software will mostly encounter non-packed files. Trying to unpacked a non-packed file will result in the same file. To conclude, it is unfeasible for a commercial anti-virus software to try to unpack every file given that most files are non-packed in the first place.
It is logical for commercial anti-virus software to first distinguish between packed and non-packed files. This way they can avoid significant computational overheads incurred due to unpacked already non-packed files. After distinguishing between packed and non-packed files, commercial anti-virus software can directly operate on non-packed files and operate on unpacked versions of packed files.
Typical signature-based packer detectors search for pre-defined strings. However, the problem with signature-based packer detectors is that they have false negatives (i.e. falsely classify packed files as non-packed files). To overcome the problem of false negatives of signature-based packer detectors, in this project we investigate a heuristics based scheme to detect packed files. The underlying intuition behind heuristics is that packers tend to change or modify the structure of files. We start by identifying a diverse set of feautres that can be statically extracted from a Windows executable file. To reduce the complexity, these features are then significantly filtered by their discrimination power. The set of selected features are then used as input to data mining algorithms for classification.