In this lab practice, we will use the KNN algorithm to detect user anomalies
Here is a link to the googleColab notebook containing the source code:
https://colab.research.google.com/drive/19xUhdOJtnhZcd9jG5IeK8dUl6F7h5gok?usp=sharing
The dataset that will be used for this lab practice can be found here:
Dataset description
We will be implementing our KNN clustering algorithm in a new Google Collab notebook. The dataset contains 30 different attribute vectors that will be used to try and identify ransomware. In this dataset, there are T*3 feature vectors , where T is a one time second interval. For each one second time interval, the traffic probe records three data points which are:
-Total Number of short commands where the response is contained within the window.
-Total number of data(TCP bytes) in the packets sent from the server to client that are not part of short commands
-Total number of data (TCP bytes) in the packets sent from client to server that are not part of short commands.
These features represent the control commands, read actions, and write actions, respectivly. A single complete sample contains these three values for every second in a designated time window. These vectors essentially analyze the file activity over a network in an attempt to detect abnormalities.
The last data column in the dataset is the label, which is a 1 or a 0. A 1 identifies the sample as ransomware or an anomoly, while a 0 means that the sample is a natural occurrence in the system. "N10S10.csv" contains 24,733 benign samples and 14,261 samples that are ransomware.
In our first code cell, we will import all of the libraries that we wish to use. Copy and paste the following code into the first cell and click the run button.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
For our next cell, you will need to download the dataset because we will be loading it into a pandas dataframe in our second cell. Copy and paste the following code into your second cell and update the directory to where your dataset is located, then run it.
data = pd.read_csv('Your/path/to/dataset.csv')
data.head()
This code should provide a result that looks like this:
Next, the dataset is split into X and Y data. The X is all the columns except the last, and the Y column is the last column. After the data is divided into X and Y, it is split into training and testing data.
Copy and paste this code into a new cell and run it.
#Create x df which is all columns except for the last column
X = df.iloc[: , :-1]
#create y df which is just the last column
y = df.iloc[: , -1]
#divide data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
Next, a function to pick the optimal K for our algorithm is created. This function is called isqrt, and it takes the nearest whole number of the square root of the size of the dataframe.
def isqrt(n):
x = n
y = (x + 1) // 2
while y < x:
x = y
y = (x + n // x) // 2
return x
Then the function is called to find the optimal k of the dataset. This k is then used to create and fit our knn model.
k = isqrt(len(X))
knn = neighbors.KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
Lastly, in order to test the accuracy, a y_pred dataframe must be created. This value is then used to test the accuracy of the model. Copy and paste the following code into a new cell and run the cell.
y_pred = knn.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
acc = accuracy_score(y_test, y_pred)
print(acc)
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
The output of this cell will be a confusion matrix with the accuracy below it.