The dataset in this module is used for phishing website detection. The dataset has attributes: SFH, popUpWindow, Request_URL, URL_of_Anchor, web_traffic, URL_Length, age_of_domain, and having_IP_Address.
In this Module, we will be implementing a Decision Tree for website phishing detection in a new Google Collab notebook.
Copy and paste the following link to open google colab
https://colab.research.google.com/notebooks/welcome.ipynb
Then click File --> New notebook
Click the red box area in the website and change the file name to Websitephishing.ipynb
Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)
On the first code cell copy and paste the following code to upload the dataset to google colab
from google.colab import files
uploaded = files.upload()
And the click the run button(looks like play button) to run this code cell
After successful execution of that cell you should be able to see the same result like the following picture. Next step, click choose upload website phishing dataset (Website Phishing.csv) into google colab.
This is the link for credit card dataset. You need to agree with the terms and conditions, and require a Gmail account.
https://www.kaggle.com/ahmednour/website-phishing-data-set
Next, Create a new cell, copy and paste the following code and run it.
The purpose of this code is to receive the data from the csv file and split the dataset into training set and test set. The split rate is 0.25.(it means that the training set is 75% of the whole dataset and the test set is 25% of the whole dataset)
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv("Website Phishing.csv",sep=",")
X = dataset.iloc[:,1:9]
y = dataset.iloc[:,9]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25,random_state = 0)
we will use decision tree algorithm to train our dataset. Create a new cell, copy and paste the following code and run it.
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion ='entropy', random_state = 0)
classifier.fit(X_train, y_train)
Finally, we test our dataset on testset and calculate the accuracy and making the confusion matrix.
Copy and paste the following code and then run it. The result should be similar to the following picture.
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
As we can see, the accuracy is 82.01%.
Now go to this link: https://colab.research.google.com/drive/1NeS9xLP9EJ4pxHpZDS2nJiEm-xCK-JoG