After completing this learning module, students will be able to:
Describe Website phishing and its impact on the real world.
Explain the Decision Tree algorithm that can be used for detecting website phishing.
Apply the Decision Tree classification algorithm to detect malicious website phishing attacks.
Decision tree classifiers are widely recognized as one of the most well-known approaches for representing data classification in classifiers. A decision tree is a tree-based approach in which each path from the root to the leaf node is characterized by a data separating sequence until a Boolean conclusion is obtained. Because of its high precision, refined splitting settings, and improved tree pruning processes, the decision tree methodology gives more efficiency and a better perspective of performance results in categorization. All well-established data classifiers employ ID3, C4.5, CART, CHAID, and QUEST. Figure 1 displays the general flow of the decision tree technique while Figure 2 illustrates the decision tree using C4.5 algorithm.
Figure 1: Illustration of Basic Structure of Decision Tree
Figure 2: Decision tree generated by c4.5 algorithm
The tremendous growth in Internet and online service usage has resulted in a significant increase in the number of web assaults. Phishing is a web attack in which phishers attempt to get sensitive information from users for fraudulent reasons. The decision tree algorithm employs characteristics to assess whether or not a site that a user request is a phishing site. We can train a classifier to detect phishing websites using a decision tree where phishing websites are typically disguised as trustworthy websites in order to earn the trust of their victims, and malicious people utilize them to collect sensitive information such as passwords or credit cards numbers from the victims.
A phishing website is a social engineering technique that imitates legitimate uniform resource locators (URLs) and web pages. It is also known as a fraudulent domain, is a URL scheme that appears suspicious for a variety of reasons, including I misspellings, (ii) pointing to the wrong top-level domain, (iii) combining a valid and a fraudulent URL, (iv) being incredibly long, (v) simply being an IP address, and many more. Sensitive information is transmitted to the hacker via such phishing websites, and the victim is hacked. A decision tree algorithm may determine the criteria for recognizing phony URLs, train a dataset to distinguish between fake and real URLs, and eventually detect phishing and benign URLs on websites.
Build your first program using decision tree:
First we have to import basic packages for Google Colab https://colab.research.google.com
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Download the dataset from https://www.kaggle.com/siddharthkumar25/malicious-and-benign-urls
Run the following codes:
#Loading the data
data0 = pd.read_csv('urldata.csv')
data0.head()
Checking the shape of the dataset
data0.shape
Listing the features of the dataset
data0.columns
Information about the dataset
data0.info()
Plotting the data distribution
data0.hist(bins = 50,figsize = (15,15))
plt.show()
After Plotting the data distribution, the result shall be following that visualize the few plots and graphs are displayed to find how the data is distributed and the how features are related to each other. In Hands on, we will learn the complete program.
References:
L. Machado and J. Gadge, "Phishing Sites Detection Based on C4.5 Decision Tree Algorithm," 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), 2017, pp. 1-5, doi: 10.1109/ICCUBEA.2017.8463818.
Jijo, Bahzad & Mohsin Abdulazeez, Adnan. (2021). Classification Based on Decision Tree Algorithm for Machine Learning. Journal of Applied Science and Technology Trends. 2. 20-28.
Nicolas Papernot (2016). Detecting phishing websites using a decision tree. Medium. https://medium.com/@NicolasPapernot/detecting-phishing-websites-using-a-decision-tree-ed069d073723
Phishing Website Detection by Machine Learning Techniques. https://colab.research.google.com/github/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques/blob/master/Phishing%20Website%20Detection_Models%20%26%20Training.ipynb#scrollTo=WTVY5lz4vJQM