In this lab practice, we will use decision trees to detect malicious web applications
Here is a link to the googleColab notebook containing the source code:
https://colab.research.google.com/drive/1FFjxdOBcORH4o1Tp4PPbfUu2pXv81Fwj?usp=sharing
The dataset that will be used for this lab practice can be found here:
https://www.kaggle.com/xwolf12/malicious-and-benign-websites
Dataset description
In this module, we will apply a decision tree classifier for malicious website detection. The dataset has attributes: url, url_length, number of special chracters, server, content length, WHOIS_COUNTRY, WHOIS_STATEPRO, and many more. We will be implementing our decision tree in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. Some of the attributes that we will look at are: The length of the URL, the number of special characters in the URL, and weather the country is the US or not the US.
All the attributes in the dataset.
URL: it is the anonymous identification of the URL analyzed in the study
URL_LENGTH: it is the number of characters in the URL
NUMBERSPECIALCHARACTERS: it is number of special characters identified in the URL, such as, “/”, “%”, “#”, “&”, “. “, “=”
CHARSET: it is a categorical value and its meaning is the character encoding standard (also called character set).
SERVER: it is a categorical value and its meaning is the operative system of the server got from the packet response.
CONTENT_LENGTH: it represents the content size of the HTTP header.
WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got from the server response (specifically, our script used the API of Whois).
WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from the server response (specifically, our script used the API of Whois).
WHOIS_REGDATE: Who is provides the server registration date, so, this variable has date values with format DD/MM/YYY HH:MM
WHOISUPDATEDDATE: Through the Who is we got the last update date from the server analyzed
TCPCONVERSATIONEXCHANGE: This variable is the number of TCP packets exchanged between the server and our honeypot client
DISTREMOTETCP_PORT: it is the number of the ports detected and different to TCP
REMOTE_IPS: this variable has the total number of IPs connected to the honeypot
APP_BYTES: this is the number of bytes transferred
SOURCEAPPPACKETS: packets sent from the honeypot to the server
REMOTEAPPPACKETS: packets received from the server
APP_PACKETS: this is the total number of IP packets generated during the communication between the honeypot and the server
DNSQUERYTIMES: this is the number of DNS packets generated during the communication between the honeypot and the server
TYPE: this is a categorical variable, its values represent the type of web page analyzed, specifically, 1 is for malicious websites and 0 is for benign websites
In our first code cell, we will import all of the libraries that we wish to use. Copy and paste the following code into the first cell and click the run button.
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
import graphviz
For our next cell, you will need to download the dataset because we will be loading it into a pandas dataframe in our second cell. Copy and paste the following code into your second cell and update the directory to where your dataset is located, then run it.
data = pd.read_csv('Your/path/to/dataset.csv')
data.head()
This code should provide a result that looks like this:
Next, we are going to remove all of the categorical values from our dataframe, remove any NA values in our dataframe, and also the 'CONTENT_LENGTH' because there are too many null values in this column. Copy and paste this code into a new cell and run it.
data.drop(['URL','CONTENT_LENGTH' ,'CHARSET', 'SERVER', 'WHOIS_STATEPRO', 'WHOIS_REGDATE', 'WHOIS_UPDATED_DATE', ], axis =1, inplace=True)
data.dropna(inplace=True)
print(data.isnull().sum())
The output should be a summation of all the NA values in each of the columns of our data frame, and should look like this:
Next, we would like to make the 'WHOIS_COUNTRY' column into useable data. Since in our dataset the majority of URL's originate from the USA, we will create a new column 'IS_USA' that is a binary value 1 for the country being the USA or 0 when the country is not the USA. Copy and paste the following code into a new cell and run it.
US_Filter = (data['WHOIS_COUNTRY'] == 'US')
data['IS_USA'] = ''
data.loc[~US_Filter, ['IS_USA']] = 0
data.loc[US_Filter, ['IS_USA']] = 1
data.drop('WHOIS_COUNTRY', axis =1, inplace=True)
data.head()
The output should show the top 5 items in our dataframe, with a new column 'IS_USA' added onto the end. In addition, there should no longer be a 'WHOIS_COUNTRY' column. That should look like this.
Next, we are going to make our features dataframe, x, and our label dataframe, y. Our features dataframe will contain: 'URL_LENGTH', 'NUMBER_SPECIAL_CHARACTERS', 'TCP_CONVERSATION_EXCHANGE', 'DIST_REMOTE_TCP_PORT', 'REMOTE_IPS', 'APP_BYTES', 'SOURCE_APP_PACKETS', 'REMOTE_APP_PACKLETS', 'SOURCE_APP_BYTES', 'REMOTE_APP_BYTES', 'APP_PACKETS', 'DNS_QUERYTIMES', and 'IS_USA'. Our label dataframe will just contain the 'Type' column. After we have our two dataframes, we will split our data into train and test data, then train our decision tree. Copy and paste the following code into a new cell and run it.
X = data.drop('Type',axis=1)
y = data['Type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(X_train, y_train)
tree.plot_tree(classifier)
The output should show the decision tree being built, and should have a small visualization of our decision tree at the bottom. It will look similar to this.
Next we would like to actually be able to view our decision tree, so we will use graphviz to create a pdf version of the decision tree that is easier to view. Copy and run the following code in a new cell to make a pdf of our decision tree.
dot_data = tree.export_graphviz(classifier, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("dtree")
This will output a pdf file to the left under files which can be downloaded and viewed. Below is an example of what my decision tree looks like:
Here is a close up of some of the individual nodes:
As you can see, the nodes near the top of our decision tree have the lowest Gini indices. Here is a key to find out which attribute each decision node is testing:
X[0] = 'URL_LENGTH'
X[1] = 'NUMBER_SPECIAL_CHARACTERS'
X[2] = 'TCP_CONVERSATION_EXCHANGE'
X[3] = 'DIST_REMOTE_TCP_PORT'
X[4] = 'REMOTE_IPS'
X[5] = 'APP_BYTES'
X[6] = 'SOURCE_APP_PACKETS'
X[7] = 'REMOTE_APP_PACKLETS'
X[8] = 'SOURCE_APP_BYTES'
X[9] = 'REMOTE_APP_BYTES'
X[10] = 'APP_PACKETS'
X[11] = 'DNS_QUERYTIMES'
X[12] = 'IS_USA'
This key is derived by cross referencing the number inside x[N] with our output from the 4th cell and removing the Type category. As we can see in this model, the 'NUMBER_SPECIAL_CHARACTERS' is the best predictor of weather or not a link will be malicious.
Lastly, we would like to check the accuracy of our decision tree. To do this, we will use our model on our test values. After this is done, we will make a confusion metric, and use this to find the accuracy of our model. Copy and past the following code into a new cell and then run the cell to find the accuracy and confusion matrix for the decision tree:
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
Below is an example of the output of the code:
As we can see, our accuracy for this model is 92.98%.