After completing this learning module, students will be able to:
Describe what is malicious web application.
Explain why decision tree algorithm is beneficial for malicious website detection.
Apply decision tree algorithm to identify and detect for malicious web application
The concern of Malicious web applications
Malicious web applications are a major security threat. These web applications are typically applications that claim to be something that they are not. As soon as these applications are clicked, they compromise the computer. Machine learning can help to identify these malicious links and filter them out in order to protect users.
What are Malicious web applications?
Web applications are accessed through a web browser by tying in a URL or clicking a link or shortcut to a webpage. Cybercriminals can create web applications that claim to perform a specific task, but once clicked can hijack the user's browser. Once the browser is hijacked, many different bad things can occur. The most common misdeed that web applications perform is downloading other viruses off the internet without notifying the user.
The most frequent type of malicious link are ones that look nearly identical to legitimate links. Here are some examples of what that may look like:
www.ammazon.com
www.twittter.com
www.linstagram.com
www.youtobe.com
It may not be obvious that these are fake links because they are intended to be deceptive. However, if you look closely, it is apparent that all of these links are slightly different than what most users notice at first glance, and as a result will not take you to the intended webpage. These malicious web applications exist to capture your personal data or to have you accidently download malware or a virus.
Therefore it is important to be able to detect if a web application is malicious or not. If we are given a URL, we are required to classify it as a legitimate link or a malicious one.
Decision Trees
Decision trees can be used to solve both regression and classification problems. Decision trees use a tree representation to solve problems, and each leaf of the tree is a classification, while each internal node of the tree represents an attribute. The algorithms used to create decision trees work from a top-down approach, by choosing a variable at each step that most effectively splits the set of items.
Below is an example of the structure of a decision tree:
The orange leaf nodes are the classifications that are made from the blue decision nodes, or the black root node that is also a decision node. With this model it is easy to tell that some decisions are more impactful than others.
Decision tree with Gini Index
Gini Index is a metric that measures how often a randomly chosen element would be incorrectly identified for a given attribute. Therefore, an attribute with a lower Gini index is preferred. The Gini index of an attribute can be between 1 and 0. A Gini index of 0 would mean that all the elements at the decision can be perfectly predicted. A Gini index of 1 signifies that all of the elements at that specific decision are randomly distributed.
In the above equation,
The left-hand side is the Gini Index.
Pi is the probability of an element being classified for a distinct class.
N is the number of elements
Here is an example in which we will implement the Gini Index to find the optimal decision tree. Here is a sample dataset:
This sample contains 5 attributes: A, B, C, D, and E. Attribute E is what we would like to predict, as such it only has two possible values, true or false. This data set contains 20 entries with 12 of the F attributes being 'TRUE' and 8 of the F attributes being 'FALSE'.
With this example, we would like to calculate the Gini index of each of the attributes: A, B, C, and D, and build a decision tree based off each attribute's Gini score.
Gini Index requires us to choose random values to categorize each attribute, so the values for this data set we will choose are:
A>= 1.0 B>=3.0 C>=5.0 D>=11.0
First, we are going to calculate the Gini Index for Var A, such that Value >= 1.
There are a total of 13 values in A that are >=1.
Attribute A >= 1 & class = TRUE: 8/13
Attribute A >= 1 & class = FALSE:5/13
Gini(8,5) = 1 - [(8/13)^2+(5/13)^2] = .4734
There are a total of 7 values in A that are < 1.
Attribute A < 1 & class = TRUE: 4/7
Attribute A < 1 & class = FALSE: 3/7
Gini(4,3) = 1 - [(4/7)^2+(3/7)^2] =.4898
Now we add the weight and sum the two Gini indices for A.
(13/20)*.4734 + (7/20)*.4898 = .4791
Next, we are going to calculate the Gini Index for Var B, such that Value >= 3.
There are a total of 12 values in B that are >=3.
Attribute B >= 3 & class = TRUE: 10/12
Attribute B >= 3 & class = FALSE:2/12
Gini(9,3) = 1 - [(10/12)^2+(2/12)^2] = .2778
There are a total of 8 values in B that are < 3.
Attribute B < 3 & class = TRUE: 2/8
Attribute B < 3 & class = FALSE: 6/8
Gini(3,5) = 1 - [(2/8)^2+(6/8)^2] =.375
Now we add the weight and sum the two Gini indices for B.
(12/20)* .2778 + (8/20)*.375 = .3167
Next, we are going to calculate the Gini Index for Var C, such that Value >= 5.
There are a total of 10 values in B that are >=5.
Attribute C >= 5 & class = TRUE: 4/10
Attribute C >= 5 & class = FALSE:6/10
Gini(4,6) = 1 - [(4/10)^2+(6/10)^2] = .48
There are a total of 10 values in C that are < 5.
Attribute C < 5 & class = TRUE: 8/10
Attribute C < 5 & class = FALSE: 2/10
Gini(8,2) = 1 - [(8/10)^2+(2/10)^2] =.32
Now we add the weight and sum the two Gini indices for C.
(10/20)*.48 + (10/20)*.32 = .4
Lastly, we are going to calculate the Gini Index for Var D, such that Value >= 11.
There are a total of 9 values in D that are >=11.
Attribute D >= 11 & class = TRUE: 2/9
Attribute D >= 11 & class = FALSE:7/9
Gini(2,7) = 1 - [(2/9)^2+(7/9)^2] = .3457
There are a total of 11 values in D that are < 11.
Attribute D < 11 & class = TRUE: 10/11
Attribute D < 11 & class = FALSE: 1/11
Gini(10,1) = 1 - [(10/11)^2+(1/11)^2] =.1652
Now we add the weight and sum the two Gini indices for D.
(9/20)*.3457 + (11/20)*.1652 = .2464
We now have all the Gini index values for our attributes as follows:
Gini Index of A: .4791
Gini Index of B: .3167
Gini Index of C: .4
Gini Index of D: .2464
From this, we are able to tell that attribute D does the best job predicting the value of E, and attribute A does the worst job of predicting the value of E. From these Gini index scores, we are able to create an optimized decision tree.
The reason that we removed the A and C attribute from our decision tree is because their Gini index was high enough that it proved to not be much better than random. It would be bad to include randomness in a model that is supposed to predict a binary output.
Decision tree with Information Gain
Decision trees can also be formed through a metric called Information Gain. Information Gain is a metric that shows how much information is gained from each attribute. This means that attributes with a higher Information Gain metric are placed higher in the decision tree. Information gain is measured by subtracting the entropy of a split from the initial entropy of a set. Entropy is a measure of the randomness of a set. Decision trees that use information gain use a metric call entropy to choose how to split data.
This is how to calculate the entropy of a class. In this equation, Pi is the probability of randomly selecting an element in a class, i. For example if we have a dataset with 5 reds and 3 blues, our entropy equation for this set is: E = -(prlog2(pr)+pblog2(pb)) simplified as: E = -(5/8log2(5/8)+3/8log2(3/8).
This is thee equation to quantify the information gain of a split in a decision tree. E parent if the entropy of the parent node, and E child is the average entropy of the children nodes. This equation makes sense, because our goal is to remove entropy from our parent node.
The differences between Gini Index and Information Gain
Gini Index is measured by subtracting the sum of squared probabilities of each decision from one. Information gain is calculated by multiplying the probability of the class by log(2) of that class probability.
Gini index favors larger distributions and is easy to implement, whereas information gain works better with smaller distributions with multiple distinct values.
Gini index gives results in a binary manor, and is only capable of doing binary splitting. Information gain on the other hand, measures the entropy differences before and after splitting the data to show the impurity of the class variables.
Pros and Cons of Decision Trees
Pros
Simple to understand and interpret
Can handle both numerical and categorical data
Can handle large datasets efficiently
Con
Small changes to the training data can result in a large change to the tree.
The tree's average depth is not guaranteed to be minimal.