Nikhil Goparapu

Malware Detection using Machine Learning

Github Repo

Part 1 Background

The malware industry has evolved into a well-organized market involving vast sums of money in recent years. Many organizations around the world are investing heavily in technologies and methods to build a system to detect and deactivate these malware files. Following are some popular methods that are used to detect malware:

Signature-Based Detection: Each malware carries a unique code that is used to identify malware. This method uses a virus code to identify malware. When a system reads a file, it scans the file and collects all the codes present in it to a database that has a vast collection of such codes. If the code is present in the database then the file is classified as Malware.

Drawbacks: If there are new and previously unknown attacks whose signature (unique code) is not present in the database, then the attack will go undetected.

Heuristic Analysis: It is a rule-based detection method. Here, experts come up with a set of rules that a file cannot violate. For example, some rules are:

Manipulation of the camera is banned.
Direct access to the hardware is prohibited.

In this project, I would like to use Machine Learning to detect malware. Here a machine learning model, when trained on a large amount of data comes up with its own set of rules to classify whether a file is malicious or not.

Project Goal

The goal of this project to extract useful features from raw bytes file and asm file and build a supervised machine learning model on top of those features to classify a malware file.

Dataset:

I have collected the data from the Kaggle. Microsoft has open-sourced the dataset as a part of a competition on Kaggle. It can be used for educational and research purposes.

The dataset consists of 10,868 bytes file in a hexadecimal representation and 10,868 asm files.
The files in the dataset represent a mix of 9 different families of malware.
A file may belong to any of the following 9 classes of malware: Ramnit, Lollipop, Kelihos_ver3, Vundo, Simda, Tracur, Kelihos_ver1, Obfuscator.ACY, Gatak.

Methodology:

Phase 1 Video

Phase 2

Exploratory Data Analysis

1. Distribution of Classes

The distribution of the data is not uniform. The dataset that I'm dealing with is imbalanced. We can clearly see from the below histogram that classes 1, 2, and 3 occur a lot, while 4, 5, and 7 occur the least. Among all the classes the frequency of class 3 is the highest and the frequency of class 3 is the least. To maintain the same distribution in both training data and testing data I have used stratified sampling to split the data into training and testing sets.

2. Distribution of the sizes of the Byte Files per class.

I have calculated the size of each byte file and plot a box plot of the size of byte files for each class to compare the distribution of byte file sizes for each class.

Observation:

The distribution of the byte file sizes for each class is not the same.
The byte file sizes of class 2 range from around 0.3 MB to approximately 13 MB.
The size of all the byte files in class 3 is just above 8 MB.
The byte file sizes of class 7 are between 4 MB to 6 MB.

3. Distribution of the sizes of the asm Files per class.

I have calculated the size of each asm file and plot a box plot of the size of asm files for each class to compare the distribution of asm file sizes for each class.

Observation :

The distribution of the size of the asm files is different for each class.
The size of asm files is large compared to that of byte files.

Feature Engineering

1. Extracting Unigrams from byte files.

The first image below is what a raw byte file looks like. It is a hexadecimal representation of the file's binary content. In total there are 256 unique values ranging from 00, 01, 02,.....FE, FF.

Just as we extract unigrams from documents, I have extracted unigrams froms the raw byte file and created a dataframe (second image below) . Each row in the dataframe represents a byte file. Each column represents a unique hexadecimal value. And each cell in the dataframe represents the number of occurances of that hexadecimal value in that byte file. The last column (??) in the dataframe represents the count of missing values in each byte file.

1. Image of a byte file

2. Unigrams Dataframe

2. Extracting Images from asm files

I have extracted images out of the asm files. Below are the sample images of the byte file of 3 classes of malware.

An Image from Class 1 (Ramnit)

An Image from Class 2 (Lollipop)

An Image from Class 3 (Kelihos_ver3)

3. Visualizing the 257-D Unigrams data on a 2-D plane using T-SNE

I have reduced the 257-dimensional data to 2-dimensional data using T-SNE. Below is the scatter plot of the 2-Dimensional data. Each color represents a different class. We can see that similar colored points are clustered together with some overlap in the center of the image.

T-SNE mainly preserves the local structure of the data while converting high dimensional data to low dimensional data. It also suffers from crowding problem. This might be the reason that we see some overlap in the center of the image. There is a high chance that these data points are well separated in the 257-dimensional space.

Modeling

Random Forest

The first model that I am using is the Random Forest. It is a pretty good model to start with. I have 100 decision trees(n_estimators = 100) with a depth = 10.

I have sampled 80% of the data as my training data and 20% of the data as my test data. I have used stratified sampling since the distribution of the classes is not uniform.

Performance:

Accuracy on Training Data: 98.98%
Accuracy on Test Data: 97.79%

Phase 2 Video

Phase 3

Image Features and File sizes

I have extracted the first 800 pixels from images that I have created and appended it to the unigrams dataframe. I have also added the byte file size and asm file size of each file to the unigrams dataframe.

After appending the image features and file sizes to the unigrams dataframe, the dataframe contains:

10868 rows.
1060 columns.

Visualizing the 1060-D data on a 2-D plane using T-SNE

Modeling

I have decided to use XG Boost for modeling. I have done hyperparameter tuning using random search and found out that the best hyperparameters are:

n_estimators = 800
max_depth = 3

Performance:

Accuracy on the training set: 100%
Accuracy on the test set: 99.218%

Classification report on 9 classes of Malware

Confusion Matrix on the test set with a number of correct predictions in diagonal.

Confusion Matrix on the test set with a percentage of correct predictions in diagonal.

Phase-3 Video

Limitations and Future Scope

Data is the main limitation.
My model can only detect a malware file belonging to any one of the 9 classes.
If we have access to the labeled data of the other families of Malware files, we can extend our model to detect malware files belonging to other families of malware.

References:

Raff, Edward, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean, and Charles Nicholas. "An Investigation of Byte N-gram Features for Malware Classification." Journal of Computer Virology and Hacking Techniques 14.1 (2016).
Nataraj, L., S. Karthikeyan, G. Jacob, and B. S. Manjunath. "Malware Images." Proceedings of the 8th International Symposium on Visualization for Cyber Security - VizSec '11 (2011).
Masud, Mohammad M., Latifur Khan, and Bhavani Thuraisingham. "A Scalable Multi-level Feature Extraction Technique to Detect Malicious Executables." Information Systems Frontiers.
https://github.com/skshashankkumar41/Microsoft-Malware-Detection/blob/master/Microsoft_Malware_Detection.ipynb

Page updated

Report abuse