In this lab practice, we will develop a machine learning model for ransomware classification.
Here is a link to the googleColab notebook containing the source code:
https://colab.research.google.com/drive/1jZ91DJrzhieBEEBogFFqy84C6Ta1wqNM?usp=sharing
The dataset that will be used for this lab practice can be found here:
The dataset that will be used for this lab practice can be found here:
https://data.mendeley.com/datasets/yhg5wk39kf/2
For this lab, the file "N10S10.csv" was downloaded from the N10S10 folder.
Dataset description
We will be implementing our KMeans clustering algorithm in a new Google Collab notebook. The dataset contains 30 different attribute vectors that will be used to try and identify ransomware. In this dataset, there are T*3 feature vectors , where T is a one time second interval. For each one second time interval, the traffic probe records three data points which are:
-Total Number of short commands where the response is contained within the window.
-Total number of data(TCP bytes) in the packets sent from the server to client that are not part of short commands
-Total number of data (TCP bytes) in the packets sent from client to server that are not part of short commands.
These features represent the control commands, read actions, and write actions, respectivly. A single complete sample contains these three values for every second in a designated time window. These vectors essentially analyze the file activity over a network in an attempt to detect abnormalities.
The last data column in the dataset is the label, which is a 1 or a 0. A 1 identifies the sample as ransomware, while a 0 means that the sample is benign. "N10S10.csv" contains 24,733 benign samples and 14,261 samples that are ransomware.
In our first code cell, we will import all of the libraries that we wish to use. Copy and paste the following code into the first cell and click the run button.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, precision_score, recall_score, f1_score
from sklearn.datasets import load_digits
For our next cell, you will need to download the dataset because we will be loading it into a pandas dataframe in our second cell. Copy and paste the following code into your second cell and update the directory to where your dataset is located, then run it.
data = pd.read_csv('Your/path/to/dataset.csv')
data.head()
This code should provide a result that looks like this:
Next, the data is split into an X dataframe containing the features and a y dataframe which contains the label. This code looks like this:
#Create x df which is all columns except for the last column
X = data.iloc[: , :-1]
#create y df which is just the last column
y = data.iloc[: , -1]
Now the data should be turned split into training and testing data, with the train_test_split function. Copy and paste the following code into a new cell and run it.
#divide data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
Now that the dataframe is properly split, it is time to apply the Kmeans algorithm to the training set. Copy and paste the following code into a new cell to fit the data to the KMeans clustering algorithm.
#provide n clusters to make
n_clusters = 2
classifier = KMeans(n_clusters=n_clusters)
classifier = classifier.fit(X_train)
The n_clusters variable is used to tell the algorithm how many clusters it should make.
Lastly, accuracy of the algorithm should be evaluated to determine how effective it is on the dataset. Copy and paste the following code into a new cell to evaluate performance.
y_pred = classifier.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(acc)
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
print("The precision is: {:.2%}".format(precision))
print("The Recall: {:.2%}".format(recall))
print("F1 Score is: {:.2%}".format(f1))
This code should give a result similar to this: