CP2421 Machine Learning for Cybersecurity

CP2421 Have been officially renamed to "Machine Learning for Cybersecurity"

Subject Link: https://secure.jcu.edu.au/app/studyfinder/index.cfm?subject=CP2421&year=2023&transform=subjectwebview.xslt

Prerequisites:

CP1401 Problem Solving and Programming I

MA1580 Foundations of Data Science

Subject Assessment

Examination (centrally administered) - (40%)
Case study analysis - (30%)
Project report - (30%)

Lesson Plan

Introduction Machine Learning Approach
Log Analysis for Web Security [Log Analysis; Python]
Botnet Detection [Botnet detection; Python]
User Activity Keystroke Analysis [Keystroke analysis; Weka]
User Activity Mouse Dynamics Analysis [Mouse Dynamic Analysis; Python]
Adversarial Machine Learning [Adversarial ML; Python]
Machine Learning for Web Security; Deep Packet Inspection [Deep packet inspection; Python]
Machine Learning for Intrusion Detection [Intrusion Detection; Python]
Malware Detection using ML [Malware Detection; Python]
Phishing Detection using ML & Automated Alert Correlation [Phishing detection; Python]

Terminology

MODEL: a mathematical representation of a real world process; a predictive model forecasts a future outcome based on past behaviors.

TRAINING: the process of creating a model from the training data. The data is fed into the training algorithm, which learns a representation for the problem, and produces a model. Also called “learning.”

CLASSIFICATION : a prediction method that assigns each data point to a predefined category, e.g., a type of operating system.

TRAINING SET: a dataset used to find potentially predictive relationships that will be used to create a model.

FEATURE: also known as an independent variable or a predictor variable, a feature is an observable quantity, recorded and used by a prediction model. You can also engineer features by combining them or adding new information to them.

ALGORITHM: a set of rules used to make a calculation or solve a problem.

REGRESSION: a prediction method whose output is a real number, that is, a value that represents a quantity along a line.

TARGET: in statistics, it is called the dependent variable; it is the output of the model or the variable you wish to predict.

TEST SET: a dataset, separate from the training set but with the same structure, used to measure and benchmark the performance of various models.

OVERFITTING: a situation in which a model that is too complex for the data has been trained to predict the target. This leads to an overly specialized model, which makes predictions that do not reflect the reality of the underlying relationship between the features and target.

Lesson 1 Introduction

Confusion Matrix

Machine learning: where a system improves its performance through analysis of previous performance

Unsupervised learning: where the machine learning takes place entirely through the system analysing and categorising the available data

Supervised learning: where sample data is supplied to the system with associated data relating to the outcome of its use

Reinforcement learning: where an agent learns by receiving graded rewards for actions taken

imgur.com/a/MN6bmIN

https://link.springer.com/article/10.1186/s40537-020-00318-5

Student Requirement

Google Account
Google Drive

Readings for Lesson 1

https://colab.research.google.com/

https://www.sciencedirect.com/science/article/pii/S0022000014000178

https://ieeexplore.ieee.org/document/8567980

https://nlp.stanford.edu/IR-book/html/htmledition/k-nearest-neighbor-1.html

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Machine Learning with Python: Foundations Online Class | LinkedIn Learning, formerly Lynda.comLearn the basics of machine learning and how you can create a machine learning model with Python.

Go thru the preview videos to get the basics

https://www.linkedin.com/learning/certificates/3f2379f4d4d1d0215517ac0f43f2c2c2ed89a83274d4c88dfaf66e9cd5ded04d?u=2223545

How to import data in Python | LinkedIn Learning, formerly Lynda.comIn this video, learn what pandas.Series and pandas.DataFrame objects are, as well as how to import data from both a CSV and Excel file into a DataFrame in Python.

Pandas documentation: https://pandas.pydata.org/docs/

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

https://pandas.pydata.org/docs/index.html

import pandas as pd

members = ["Brazil", "Russia", "India", "China", "South Africa"]

brics1 = pd.Series(members)

brics1

type(brics1)

members = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],

"capital": ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],

"gdp": [2750, 1658, 3202, 15270, 370],

"literacy": [0.944, 0.997, 0.721, 0.964, 0.943],

"expectancy": [76.8, 72.7, 68.8, 76.4, 63.6],

"population": [210.87, 143.96, 1367.09, 1415.05, 57.4]

}

brics2 = pd.DataFrame(members)

brics2

type(brics2)

members = [["Brazil", "Brasilia", 2750, 0.944, 76.8, 210.87],

["Russia", "Moscow", 1658, 0.997, 72.7, 143.96],

["India", "New Delhi", 3202, 0.721, 68.8, 1367.09],

["China", "Beijing", 15270, 0.964, 76.4, 1415.05],

["South Africa", "Pretoria", 370, 0.943, 63.6, 57.4]]

labels = ["country", "capital", "gdp", "literacy", "expectancy", "population"]

brics3 = pd.DataFrame(members, columns = labels)

brics3

brics4 = pd.read_csv("brics.csv")

brics4

brics5 = pd.read_excel("brics.xlsx")

brics5

brics6 = pd.read_excel("brics.xlsx", sheet_name = "Summits")

brics6

Lesson 2 Log Analysis for Web Security

Reading for Lesson 2

https://dl.acm.org/doi/pdf/10.1145/3400286.3418261?casa_token=IJj6EkmtDnUAAAAA:ePJwtrZGfRtznkKaa7dPB3nMeozAgOQIjBsjaFOfFtO_XNfBfNPqiH0rztqBd7-NYATch08SvpAo

https://www.sciencedirect.com/science/article/pii/S0167404820300250

https://www.ptsecurity.com/ww-en/analytics/web-vulnerabilities-2020/

https://developer.nvidia.com/blog/cybert-rapids-ai/

http://opendl.ifip-tc6.org/db/conf/im/im2013/MakanjuZM13.pdf

https://dr.lib.iastate.edu/handle/20.500.12876/17031/

https://jis-eurasipjournals.springeropen.com/articles/10.1186/s13635-018-0081-z

https://www.sciencedirect.com/science/article/abs/pii/S1389128615002650?casa_token=HlqHRH7baNsAAAAA:9v81L-2j1_FBrO5155mgdIQzQZCLWUOs6vL-1W2J-4iFbQZaXAf5zJaZLM3afC5QLL1L3rkHTQ

https://www.researchgate.net/profile/Alf-Larsson/publication/295084228_Operational_Log_Analysis_for_Big_Data_Systems_Challenges_and_Solutions/links/5bd9f4b792851c6b279c99a9/Operational-Log-Analysis-for-Big-Data-Systems-Challenges-and-Solutions.pdf

https://cse.sc.edu/~huangct/CSCE813F16/07544930.pdf

https://dl.acm.org/doi/pdf/10.1145/2991079.2991122?casa_token=DvY8N2V3i9IAAAAA:SRbKXQAo44AMmVEvqmcblchCuxis3Ldk1BkLQtejnZ6PdsoOCXexfS4CJ0mMXqEVIGEhnyWqfSp4

https://dl.acm.org/doi/pdf/10.1145/2523649.2523670?casa_token=tqjl2hCPH3UAAAAA:Widt2VtDly2sPrjcJCQ-XlCPzlaWABrE36xfmGE6ocZPtGc7QOJ_eMC7wAg4a7E2Fk_zttemCxdK

The practicals have been used from Walid daboubi’s github:

https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning

Link to colab:

https://colab.research.google.com/drive/1x0LbDfN2yj9vYQmW_aRvxkeI41CIAkWA?usp=sharing

The dataset is obtained from the following link: http://www.secrepo.com /self.logs/

Get the dec_2016.csv and feb_2017.csv from https://drive.google.com/drive/folders/1AoB_mBMVKU2owkVBr4xxzKCgKmh-BlYb?usp=sharing

Allow Colab to access Google Drive

"Permit this notebook to access your Google Drive files?

This notebook is requesting access to your Google Drive files. Granting access to Google Drive will permit code executed in the notebook to modify files in your Google Drive. Make sure to review notebook code prior to allowing this access."

Correction of Typo required

=-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-21-3de3a6570295> in <module>

# Train the classifier

---> attack_classifier = attack_classifier.fit(training_features, traning_labels)

# Get predections for the testing data

NameError: name 'traning_labels' is not defined

For the Project Report, you are allow to use the Logistic Regression

Lesson 3 Botnet Detection

Tool

https://www.wireshark.org/

Botnet Dataset

https://www.uvic.ca/engineering/ece/isot/datasets/botnet-ransomware/index.php

https://www.unb.ca/cic/datasets/botnet.html

Reading List

https://www.academia.edu/download/36990790/A_Survey_of_Botnet_and_Botnet_Detection.pdf

https://www.spamhaus.com/custom-content/uploads/2020/04/2019-Botnet-Threat-Report-2019-LR.pdf

https://securelist.com/bots-and-botnets-in-2018/90091/

https://ieeexplore.ieee.org/document/8026031

https://link.springer.com/chapter/10.1007/978-3-642-01440-6_27

https://ieeexplore.ieee.org/document/4569852

https://ieeexplore.ieee.org/document/5455789

https://ieeexplore.ieee.org/document/7891834

https://hal.inria.fr/hal-01636480/file/botgm-csnet.pdf

Practical

based on the following Github code:

https://github.com/ShehzadaAlam/Botnet-Detection/blob/master/Botnet%20Detection.ipynb

https://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html

Link to colab: https://colab.research.google.com/drive/15nFlwvwueZjfp5MrVyb9EfuNk4cwCXZE?usp=sharing [I would suggest making at least 2 copies of this colab]

Data from here: https://drive.google.com/drive/folders/14SLU9--GnB8hbdVyZ0rUtvb15TWOExkn?usp=sharing [right-click and make a copy. Then move the data to your Colab folder. Rename to capture20110810.binetflow.2format ]

https://mcfp.weebly.com/ctu-malware-capture-botnet-42.html [download the capture20110810.pcap.netflow.labeled file. It is 370 Mb]

Note: The training process took me about 3 minutes.

Lesson 3; Step 3 : Read botnet data:

Note : Make sure to use the same variables as the ones set by you.

dataset = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Datasets/capture20110810.binetflow")

dataset = dataset[dataset['Label'].str.contains('Botnet')]

dataset.head() #print 5 rows

Lesson 3; Step 4 : Preprocess the datasets to get rid of missing values:

Optional:

Use dataset.describe() instead of dataset.description()

Graph may be different but you should get close to 0.30

Lesson 3; Step 5: Preprocessing step continues:

#Omit LastTime as this column is NOT in dataset

dataset = dataset.astype({"Proto":'category',"Sport":'category',"Dport":'category',"State":'category','StartTime':'datetime64[s]'})

dataset

# NOT Required as we can use the dur (Duration) column directly

# Getting duration from the columns 'LastTime' and 'StartTime'

dataset['duration'] = abs(dataset['LastTime'].dt.second - dataset['StartTime'].dt.second)

# Drop the selected columns excluding LastTime

# LastTime does NOT exist in the dataset

dataset.drop(columns=['SrcAddr','DstAddr','StartTime'],inplace=True)

dataset

Lesson 3; Step 6: Training and evaluating a classification model: Random Forest Classifier (RTC)

# List the column names to be removed from X (including the Dir column as it is NOT numeric)

columns = ["Proto", "Sport", "Dport", "State", "Dir"]

# Drop columns having NaN values. Exclude the sHops & sTtl as these 2 column does NOT exist

columns = ["sTos", "sHops", "sTtl"]

X = X.drop(columns, axis=1)

X

My Decision Accuracy Score: 96 %

*Note you may get a different number

Lesson 3; Step 7: Repeat training and testing for Random Forest Classifier. Get the performance results and compare their performance. Remember to complete this step on your own.

Lesson 4 Keystroke Analysis

Reading List

https://www.sciencedirect.com/science/article/pii/S0167404816300657?casa_token=unl_xln3KqgAAAAA:9S5fRrIVtN63Bf-w_j05SOU1-yS9OhdFjfMc7uuznHXEmVn1-RziPVEHdwDiQ-F2sC1Wg1oW_w

https://www.iii.org/fact-statistic/facts-statistics-identity-theft-and-cybercrime

https://www.researchgate.net/publication/336436859_Analysis_of_Authentication_System_Based_on_Keystroke_Dynamics

https://www.sciencedirect.com/science/article/abs/pii/S0925231216314321?casa_token=6BYmiXROrnYAAAAA:ZHk-dqT99NpmIBZ_pSsCbTQCN0tBGchQVsyocgLrbsGOUwXKtTUEVjxuEI1FJnapf7HAYnoLJQ

https://www.biometrie-online.net/images/stories/dossiers/generalites/International-Journal-of-u-and-e-Service-Science-and-Technology.pdf

https://www.theseus.fi/bitstream/handle/10024/44684/Babich_Aleksandra.pdf

https://link.springer.com/article/10.1007/s13173-013-0117-7

http://www.cs.cmu.edu/afs/cs/Web/People/maxion/pubs/KillourhyMaxion09.pdf

https://dl.acm.org/doi/pdf/10.1145/75577.75582?casa_token=U9iTSHfly0gAAAAA:MVZwyCtdzkab2k8UvuZZd9ZHQxMEDcFx7hws2kkPWIoH1-AgtcSUDMIgdiOHh2CsqXuiJjjInG1w

https://link.springer.com/content/pdf/10.1007/978-3-540-74549-5_125.pdf

For the practical, you need to download and install WEKA on your local computer:

https://waikato.github.io/weka-wiki/downloading_weka/

** This can take time, therefore it is recommended to install WEKA before you attend the practical.

Practical

Download the DSL-StrongPasswordData.csv file from here:

http://www.cs.cmu.edu/~keystroke/ OR

https://drive.google.com/file/d/1LLCPlzYXvZbeX-o7EalsdFMpfulV17_-/view?usp=sharing

Lesson4; Step 3: Load the data into the specified machine learning models and build a classification model.

RandomForest -P 100 -print -I 300 -num-slots 1 -K 34 -M 1.0 -V 0.001 -S 1 -batch-size 200

In case you should run into an OutOfMemoryException,

Try again. Sometimes the 1st run is NOT successful.

If the error presist, you will have to increase the maximum heap size.

How much you can allocate, depends heavily on the operating system and the underlying hardware.

Locate and change the RunWeka.ini file. Note: I have a 8 Gb RAM machine.

# The JAVA_OPTS environment variable (if set). Can be used as an alternative way to set

# the heap size (or any other JVM option)

#javaOpts=%JAVA_OPTS%

javaOpts=%JAVA_OPTS% -Xmx8048m

Lesson4; Step 4: Result:

Correctly Classified Instances 1830 89.7059 %

Incorrectly Classified Instances 210 10.2941 %

Kappa statistic 0.8949

Mean absolute error 0.0113

Root mean squared error 0.0634

Relative absolute error 29.2818 %

Root relative squared error 45.7395 %

Total Number of Instances 2040

Lesson4; Step 5: Training Multi-Layered perceptron algorithm (MLP)

MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 1 -E 20 -H a -G -R -batch-size 200

Lesson4; Step 6: Result

Correctly Classified Instances 1720 84.3137 %

Incorrectly Classified Instances 320 15.6863 %

Kappa statistic 0.8399

Mean absolute error 0.0081

Root mean squared error 0.0714

Relative absolute error 21.0242 %

Root relative squared error 51.4638 %

Total Number of Instances 2040

Lesson 5: User Activity Mouse Dynamics Analysis

Reading List

Additional Notes

https://strbase-b.nist.gov/fileDoc/strbasePDFS/pub_pres/VallloneAAFS2010.pdf

https://www.researchgate.net/profile/Hugo-Gamboa/publication/260925814_A_behavioral_biometric_system_based_on_human-computer_interaction/links/5c110480a6fdcc494feec038/A-behavioral-biometric-system-based-on-human-computer-interaction.pdf

https://dl.acm.org/doi/pdf/10.1145/3003733.3003764?casa_token=dWqdgnEJIooAAAAA:sSEdlR55XEgQg1eLFRtjET5K_-QRD94oBuP91DHuzvVepK65dLEjx-SDVt4aBCbB5f2uzJi_kt-G

https://arxiv.org/pdf/1810.04668.pdf

https://dl.acm.org/doi/pdf/10.1145/2046707.2046725?casa_token=6W0-lBB0nyoAAAAA:7d1vjGsxdKGJTOWnHr4BNO1R1yxde8F0Tacy05J5KBBO6SNrcSIjMHhJftb-dMtQJXbpH9FKtkQX

https://github.com/balabit/Mouse-Dynamics-Challenge

Practical

Download the following two CSV (balabit_39feat_PC_MM_DD_250.csv & balabit_39feat_PC_MM_DD_50.csv) files from the link:

https://drive.google.com/drive/folders/1Yqe2Hw2ECPSu3nTIk2-wQe0dVUAUIXd2

https://colab.research.google.com/drive/1Q3-AgQdNGkeozgqfOg9dMydulEuzpptV?usp=sharing

Lesson5; Step 1: Import libraries.

numpy is a fundamental package for scientific computing with Python. More info at https://numpy.org/

import sys

import warnings

import copy

from sklearn import model_selection, metrics

from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score

import numpy as np

from sklearn import preprocessing

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

%matplotlib inline

Lesson5; Step 2: Load the training and test csv data files

MyDrive/Colab Notebooks/Datasets/Mouse_Dynamics_Data/features/

Lesson5; Step 3: Explore the data to check the quality of the data.

*Note: Column and Shape step can swap place.

Output of info()

Data columns (total 40 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 type_of_action 500 non-null int64

1 traveled_distance_pixel 500 non-null float64

2 elapsed_time 500 non-null float64

3 direction_of_movement 500 non-null int64

4 straightness 500 non-null float64

5 num_points 500 non-null int64

6 sum_of_angles 500 non-null float64

7 mean_curv 500 non-null float64

8 sd_curv 500 non-null float64

9 max_curv 500 non-null float64

10 min_curv 500 non-null float64

11 mean_omega 500 non-null float64

12 sd_omega 500 non-null float64

13 max_omega 500 non-null float64

14 min_omega 500 non-null float64

15 largest_deviation 500 non-null float64

16 dist_end_to_end_line 500 non-null float64

17 num_critical_points 500 non-null int64

18 mean_vx 500 non-null float64

19 sd_vx 500 non-null float64

20 max_vx 500 non-null float64

21 min_vx 500 non-null float64

22 mean_vy 500 non-null float64

23 sd_vy 500 non-null float64

24 max_vy 500 non-null float64

25 min_vy 500 non-null float64

26 mean_v 500 non-null float64

27 sd_v 500 non-null float64

28 max_v 500 non-null float64

29 min_v 500 non-null float64

30 mean_a 500 non-null float64

31 sd_a 500 non-null float64

32 max_a 500 non-null float64

33 min_a 500 non-null float64

34 mean_jerk 500 non-null float64

35 sd_jerk 500 non-null float64

36 max_jerk 500 non-null float64

37 min_jerk 500 non-null float64

38 a_beg_time 500 non-null float64

39 userid 500 non-null int64

Output of ['userid'].value_counts()

*Note: The order may NOT be the same but the content is the SAME

2 250

15 250

16 250

20 250

21 250

23 250

29 250

35 250

7 250

9 250

Name: userid, dtype: int64

Lesson5; Step 5: Extract features for the testing datasets

Note: the "name of the column " is userid

X_test = test_data.drop('userid', axis =1)

Y_test = test_data['userid']

Lesson5; Step 6: Train a classifier: Random Forest classifier

#First import the necessary library

from sklearn import ensemble

# create a classifier with configuration parameters

model = RandomForestClassifier(n_estimators=200, max_leaf_nodes=5 ,random_state=None) #intialize the random forest classifier

model.fit(X_train, Y_train) #train the model using .fit() and training features and labels

Output shows the parameters defined for the classifier configuration, and also confirms that the training has been completed

RandomForestClassifier(max_leaf_nodes=5, n_estimators=200)

Lesson5; Step 7: Evaluate the model using training dataset

scores = cross_validate(model, X_train, Y_train, cv=10, return_train_score=False) #calculate the cross validation score and store in a varaible score

print(scores) #print the achieved score

My OUTPUT:

*Note: Result may be different

{'fit_time': array([0.9304142 , 0.96811843, 0.97566175, 0.94981956, 0.98914838,

1.08338332, 0.93859506, 0.95835304, 0.94819069, 0.93062997]), 'score_time': array([0.03033805, 0.02822661, 0.02799368, 0.02524114, 0.02984905,

0.02503633, 0.02721095, 0.02906919, 0.03139567, 0.02880883]), 'test_score': array([0.312, 0.328, 0.296, 0.356, 0.292, 0.336, 0.292, 0.332, 0.28 ,

0.34 ])}

#import additional libraries for visulazing random forest classifier

from graphviz import Source

from sklearn import tree

from sklearn.tree import export_graphviz

#Extrcat a single tree and store in estrimation

estimator = model.estimators_[5]

# Export as dot file

Source(tree.export_graphviz(estimator, out_file=None, feature_names=X_train.columns))

OUTPUT: A TREE Diagram

Lesson5; Step 8: Evaluate the model using test dataset

#import additional libraries for plotting the confusion matrix and classification report

from sklearn.metrics import confusion_matrix, classification_report

y_pred = model.predict(X_test) #get the predicitons using testing features and save them in y_pred

print(y_pred) #print the predictions

print(confusion_matrix(Y_test, y_pred)) #print the confusion matrix

print(classification_report(Y_test, y_pred)) #print the classification report

My Output

*Note: The prediction might differ. Please don’t assume that its incorrect if the predictions are different

[12 7 15 12 15 16 20 15 35 29 35 20 15 29 12 15 12 23 15 29 12 12 20 12

16 15 15 12 12 15 12 20 20 29 29 29 29 16 29 9 12 35 7 12 16 12 15 7

15 15 15 15 15 15 12 21 35 29 35 29 29 12 12 12 12 15 12 12 12 12 15 15

12 12 15 29 15 15 12 9 23 12 16 7 12 7 15 15 29 7 23 29 15 29 16 29

15 20 15 15 29 29 29 12 29 29 16 16 16 29 16 12 15 29 23 23 23 15 23 23

15 15 15 29 16 15 12 15 15 15 23 23 12 15 29 29 35 12 9 15 29 20 12 12

16 20 12 15 29 29 7 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 7

15 20 20 20 20 20 20 20 20 20 20 20 20 20 20 7 20 20 20 20 20 20 16 20

20 15 20 20 20 20 20 15 23 29 15 23 15 16 12 12 21 23 35 35 7 29 20 12

7 12 29 9 29 7 29 15 15 15 12 20 29 15 15 15 12 20 7 15 29 15 29 29

29 7 29 7 9 29 9 15 9 15 29 23 15 12 29 29 23 15 23 12 23 7 16 15

12 29 23 23 23 7 23 15 29 15 15 29 15 29 29 15 20 12 29 29 20 15 23 12

29 20 7 12 12 15 12 29 35 29 29 15 29 29 29 15 15 12 12 16 29 29 29 35

7 29 23 29 29 16 29 15 15 12 12 29 15 15 29 29 29 29 35 29 7 29 29 29

23 23 15 15 29 29 29 29 29 20 20 29 29 12 29 23 29 29 29 29 7 35 29 23

35 29 29 29 29 29 29 15 7 16 29 7 29 9 7 29 9 29 29 29 29 35 20 29

29 29 7 29 29 35 29 29 29 29 23 29 12 29 35 12 16 20 20 20 20 20 20 20

7 20 7 16 16 7 20 16 16 7 7 7 16 7 15 15 20 7 7 7 7 15 20 7

7 20 20 20 7 29 20 7 29 7 20 20 7 7 7 7 9 7 9 7 9 9 15 9

9 9 9 9 9 9 7 9 9 9 9 12 20 9 9 9 9 9 9 29 9 9 7 9

9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 20 9 9 9 9]

Note : Confusion matrix and classification might differ. Please don’t assume that its incorrect, if the values don't match.

[[21 1 0 3 6 17 0 0 2 0]

[ 3 42 1 1 0 2 0 0 1 0]

[ 3 1 13 12 4 5 0 1 8 3]

[ 3 1 14 16 2 1 1 2 8 2]

[ 0 1 8 12 6 2 0 7 13 1]

[ 3 0 0 3 1 43 0 0 0 0]

[ 6 4 6 12 1 3 1 3 12 2]

[ 3 0 8 11 1 3 0 9 14 1]

[ 2 0 5 8 2 2 0 3 26 2]

[ 5 2 2 1 1 1 0 3 30 5]]

precision recall f1-score support

7 0.43 0.42 0.42 50

9 0.81 0.84 0.82 50

12 0.23 0.26 0.24 50

15 0.20 0.32 0.25 50

16 0.25 0.12 0.16 50

20 0.54 0.86 0.67 50

21 0.50 0.02 0.04 50

23 0.32 0.18 0.23 50

29 0.23 0.52 0.32 50

35 0.31 0.10 0.15 50

accuracy 0.36 500

macro avg 0.38 0.36 0.33 500

weighted avg 0.38 0.36 0.33 500

Lesson 6: Understanding Adversarial Attacks

Reading List

Additional Notes

https://pdfs.semanticscholar.org/57c5/2c98730c26290b2044ad45924e58cb2fb5cf.pdf

https://dl.acm.org/doi/pdf/10.1145/2810103.2813677?casa_token=hwCz9mW82BYAAAAA:fNmMAGE0UNK-LkrbbU7M39YfeGryYGlYZM5LuvsSV3VQQS8J2mSKL9Id95kw3vBMgc9yQESLhmKS

https://arxiv.org/pdf/1412.6572.pdf

https://arxiv.org/pdf/1511.07528.pdf&xid=25657,15700023,15700124,15700149,15700186,15700191,15700201,15700237,15700242.pdf

https://dl.acm.org/doi/abs/10.1145/2046684.2046692?casa_token=a1Dyp32OQQ8AAAAA%3A83ZNo9eDc2RrejGB6pconASOkW8YIF_AVHO20OiuJ2PF4yjux3a8_1MPlJrrELdfON4p49OxgjQO

https://proceedings.mlr.press/v9/kloft10a/kloft10a.pdf

https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_tramer.pdf

https://arxiv.org/pdf/1704.03453.pdf?source=post_page---------------------------

https://arxiv.org/pdf/1605.07277.pdf

https://pdfs.semanticscholar.org/57c5/2c98730c26290b2044ad45924e58cb2fb5cf.pdf

https://arxiv.org/pdf/1704.01155.pdf

https://arxiv.org/pdf/1704.02654.pdf

https://ieeexplore.ieee.org/document/7917369

Practical

Link to colab:

https://colab.research.google.com/drive/1dCZ_OoveqoRjZm6sbE0ErTLf0NmmSqoY?usp=sharing

Lesson 6; Installation [do this First]

!pip install adversarial-robustness-toolbox

Lesson 6; Step 1: Import libraries

from sklearn.ensemble import RandomForestClassifier

import numpy as np

from matplotlib import pyplot as plt

from art.estimators.classification import SklearnClassifier

from art.attacks.evasion import ZooAttack

from art.utils import load_mnist

import warnings

warnings.filterwarnings('ignore')

Lesson 6; Step 2: Load the MNIST dataset from art.utils.

To print x_test

x_test

Lesson 6; Step 3: Compute the number of features for training (x_train) and testing dataset (x_test) and save their values in n_features_train and n_features_test.

Comment the n_samples_

#n_samples_train = x_train.shape[0] #calculate the number of samples for the training dataset by taking 0th position from x_train.shape ([60000, 28, 28, 1]), which 60000

n_features_train = x_train.shape[1] * x_train.shape[2] * x_train.shape[3] #compute the number of training features from x_train.shape ([60000, 28, 28, 1]), whicn multiplies 28*28*1

#n_samples_test = x_test.shape[0] #calculate the number of samples for the testing dataset by taking 0th position from x_test.shape ([10000, 28, 28, 1]), which 10000

n_features_test = x_test.shape[1] * x_test.shape[2] * x_test.shape[3]

Typo correction to code "samples"

n_samples_test = x_test.shape[0]

n_samples_train = x_train.shape[0]

Lesson 6; Step 4: Preprocessing the data

x_test X #printing x_test

Lesson 6; Step 6: Train a classifier: Random Forest Classifier

Add the following code and run it separately

# Instantiate a classifier

from sklearn.ensemble import RandomForestClassifier

#defining the random forest classifier

model = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2,

min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',

max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,

bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,

warm_start=False, class_weight=None)

Does NOT produce output

Execute model.fit code

model.fit(x_train, y_train) #fit the model x_train and y_train

Lesson 6; Step 7: Generate adversarial samples

Note : It will take time as adversarial samples to generate.

x_train_adv

Output

array([[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.],

...,

[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.]])

x_test_adv

Output

array([[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.],

...,

[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.]])

Lesson 6; Step 8: Train and evaluate the model using the normal training data and the normal test data

Execute first before starting this step

#defining the random forest classifier

model_testing = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2,

min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',

max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,

bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,

warm_start=False, class_weight=None)

Next to Execute

#training_the_model to with normal x_train and y_train

model_testing.fit(x_train, y_train)

My outputs for y_pred, confusion matrix and classification matrix for random forest classifier .

Result might differ. Please don’t assume that its incorrect if the predictions are different

[7 2 1 0 4 1 9 9 0 9 0 2 9 0 1 7 4 7 2 9 9 6 2 4 4 0 7 4 0 1 3 1 3 4 9 2 7

1 1 1 1 7 4 1 3 0 1 2 4 4 6 0 4 7 2 3 4 1 4 5 7 0 4 2 2 3 1 4 3 2 7 0 2 8

1 9 3 7 7 7 9 6 2 9 5 4 7 3 2 1 3 6 9 3 1 0 1 3 2 4 2 0 7 4 4 4 0 1 9 4 3

1 3 9 7 9 4 4 4 2 3 4 7 6 9 9 0 5 3 0 6 6 8 2 8 1 0 1 6 4 2 7 7 1 4 1 6 2

0 4 4 7 0 0 1 9 2 0 2 3 4 2 5 4 2 3 0 5 1 9 9 1 2 7 1 1 1 8 1 5 1 2 5 0 3

4 2 3 0 1 1 1 4 7 0 5 1 0 3 8]

[[14 0 1 1 1 0 0 0 0 0]

[ 0 28 0 0 0 0 0 0 0 0]

[ 1 3 9 0 1 0 0 1 1 0]

[ 1 0 3 9 0 1 0 2 0 0]

[ 2 0 0 3 18 0 0 0 0 5]

[ 5 0 0 3 2 5 0 3 1 1]

[ 1 1 10 0 0 0 8 0 0 0]

[ 0 3 2 1 1 0 0 12 0 5]

[ 1 0 1 3 0 2 1 0 2 0]

[ 0 0 0 0 9 0 0 3 1 8]]

precision recall f1-score support

0 0.56 0.82 0.67 17

1 0.80 1.00 0.89 28

2 0.35 0.56 0.43 16

3 0.45 0.56 0.50 16

4 0.56 0.64 0.60 28

5 0.62 0.25 0.36 20

6 0.89 0.40 0.55 20

7 0.57 0.50 0.53 24

8 0.40 0.20 0.27 10

9 0.42 0.38 0.40 21

accuracy 0.56 200

macro avg 0.56 0.53 0.52 200

weighted avg 0.59 0.56 0.55 200

Lesson 6; Step 9: Train and test the model using adversarial training sample data and adversarial test sample data

Add the code BUT REMOVE min_impurity_split=None

#defining the random forest classifier

model_testing = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,warm_start=False, class_weight=None)

Add last set of code

#import the necessary libraries to print confusion matrix and classiifcation report

from sklearn.metrics import confusion_matrix, classification_report

y_pred = model_testing.predict(x_test_adv) #predict the values using the adversarial samples generated from x_test_adv, and store in y_pred

print(y_pred) #print the predictions

print(confusion_matrix(y_test, y_pred)) #print confusion matrix, use y_pred and y_test

print(classification_report(y_test, y_pred)) #print classification report, use y_pred and y_test

My result

Out might differ. Please don’t assume that its incorrect if the predictions are different.

[7 2 1 0 4 1 3 3 2 9 0 0 9 4 1 3 4 7 3 4 9 6 4 4 4 0 7 4 0 1 3 1 3 4 9 3 7

1 3 1 7 7 9 1 3 8 1 6 4 4 6 2 0 3 0 0 4 1 4 7 1 4 4 2 9 9 6 4 3 0 7 0 3 8

1 4 3 1 4 7 7 6 2 7 8 4 7 3 4 1 0 6 1 3 1 6 1 3 6 4 4 0 0 4 4 4 0 2 9 4 8

1 9 4 4 4 4 4 4 2 3 4 7 6 9 4 2 9 8 9 6 6 3 2 8 1 9 1 6 4 6 7 4 1 4 1 8 3

0 6 4 2 0 6 1 9 6 0 2 1 4 4 8 4 4 3 4 5 1 4 9 8 2 3 2 7 1 4 1 1 1 7 8 0 4

4 1 3 0 1 1 1 2 3 0 3 1 6 4 8]

[[13 0 2 0 1 0 0 0 0 1]

[ 0 26 1 0 0 0 0 1 0 0]

[ 1 3 5 4 0 0 2 0 1 0]

[ 1 0 3 10 1 0 0 0 0 1]

[ 0 1 0 1 22 0 1 0 0 3]

[ 3 0 1 6 1 1 1 1 3 3]

[ 2 0 0 0 5 0 13 0 0 0]

[ 0 2 1 1 3 0 0 13 1 3]

[ 0 1 0 0 3 0 0 1 5 0]

[ 0 1 1 2 12 0 0 0 1 4]]

precision recall f1-score support

0 0.65 0.76 0.70 17

1 0.76 0.93 0.84 28

2 0.36 0.31 0.33 16

3 0.42 0.62 0.50 16

4 0.46 0.79 0.58 28

5 1.00 0.05 0.10 20

6 0.76 0.65 0.70 20

7 0.81 0.54 0.65 24

8 0.45 0.50 0.48 10

9 0.27 0.19 0.22 21

accuracy 0.56 200

macro avg 0.59 0.53 0.51 200

weighted avg 0.61 0.56 0.53 200

Lesson 7: Machine Learning For Web Security: Deep packet Inspection

Reading List

Additional Notes

https://www.ptsecurity.com/ww-en/analytics/web-vulnerabilities-2020/

https://www.cc.gatech.edu/fac/Alex.Orso/papers/halfond.viegas.orso.ISSSE06.pdf

https://www.sciencedirect.com/science/article/abs/pii/S1389128619311247?casa_token=0wG5VXinvzkAAAAA:fhLAC7VBaIR-W7EYugtfKiY_TnH6WJsW1XsgHRsvgvweM9C7w6vn_ePoRc69TS0VTzVMASulJA

https://ieeexplore.ieee.org/document/8567980

http://users.ics.aalto.fi/mpolla/publications/polla08multinomial.pdf

https://koreascience.kr/article/JAKO201726163355323.pdf

https://www.sciencedirect.com/science/article/pii/S2405959518300493

Practical

Note: A few of the links are DOWN

Link to colab:

https://colab.research.google.com/drive/1qamjzAO3wXkXi0_sWfL7ttNwvFFcdfcH?usp=sharing

The data file is here. Use the Copy feature is easier

https://drive.google.com/drive/folders/1YPb0-n8ZkYitc1qw0aLQryyJHJzwaXn6?usp=sharing

Lesson 7; Step 1

Change the "From" to "from". Python is case-sensative

import pandas as pd

import numpy as np

from sklearn import preprocessing

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

%matplotlib inline

Lesson 7; Step 2

Data to be placed here:

MyDrive/Colab Notebooks/Datasets/Intrusion_Dataset/GeneratedLabelledFlows/

Change the code and Execute to read the dataset

#read dataset

dataset = pd.read_csv(path_to_dataset, engine='python')

dataset = pd.read_csv(path_to_dataset,encoding='cp1252')

A Warning appear but it's OK

/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (0,1,3,6,84) have mixed types.Specify dtype option on import or set low_memory=False.

exec(code_obj, self.user_global_ns, self.user_ns)

Lesson 7; Step 5

Ignore this instruction in your note:

Note: change the condition == “BENIGN” to !=“BENIGN”.

Just execute the next set of code

Lesson 7; Step 6

My Output:

The 2180 attack labels is the TOTAL of Non-BENIGN

BENIGN 5087

Web Attack – Brute Force 1507

Web Attack – XSS 652

Web Attack – Sql Injection 21

Name: Label, dtype: int64

Lesson 7; Step 7

Execute 1st

dataset_balanced.to_csv("web_attacks_balanced.csv", index=False)

The dataset_balanced_attack output may be different

Lesson 7; Step 10

Y_train & Y_test output may be different

My Output of cross_val_score

My output of Source(tree.export_graphviz(decision_tree, out_file=None, feature_names=X.columns))

gini = 0.0

samples = 5086

value = 5086.0

Confusion matrix and classification might differ. Please don’t assume that its incorrect, if the values don't match are different

precision recall f1-score support

0 0.00 0.00 0.00 0

1 1.00 1.00 1.00 2181

micro avg 1.00 1.00 1.00 2181

macro avg 0.50 0.50 0.50 2181

weighted avg 1.00 1.00 1.00 2181

Lesson 8

Reading List

Additional Notes

https://resources.sei.cmu.edu/asset_files/presentation/2009_017_400_52098.pdf

https://openaccess.city.ac.uk/id/eprint/1737/1/

https://arxiv.org/ftp/arxiv/papers/2001/2001.00917.pdf

Practical

Get a copy of the following 2 files (KDD_Train.txt & KDD_Test.txt) from:

https://drive.google.com/drive/folders/1YPb0-n8ZkYitc1qw0aLQryyJHJzwaXn6

Link to colab:

https://colab.research.google.com/drive/1j-KHUV0ZdXrctQ6yBBEI5Ydz7J4lP3e5?usp=sharing

Lesson 8; Step 3

Display the statistical information about the training dataset

dfkdd_train.describe()

Lesson 8; Step 4

Add code to remove a column called num_outbound_cmds

dfkdd_train.drop(['num_outbound_cmds'], axis=1, inplace=True)

dfkdd_test.drop(['num_outbound_cmds'], axis=1, inplace=True)

Add to print column

dfkdd_train.columns

Add to print column

dfkdd_test.columns

Lesson 8; Step 5

Click the DOWN arrow to expand the codes

Lesson 8; Step 7

Add these 2 lines of code

from imblearn.over_sampling import RandomOverSampler

from collections import Counter

Change the sample() to resample()

ros = RandomOverSampler(random_state=42)

X_res, y_res = ros.fit_resample(X, y)

print('Original dataset shape {}'.format(Counter(y)))

print('Resampled dataset shape {}'.format(Counter(y_res)))

Lesson 8; Step 8

I am using (11,4) as size but you can use other dimension

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier();

# fit random forest classifier on the training set

rfc.fit(X_res, y_res);

# extract important features

score = np.round(rfc.feature_importances_,3)

importances = pd.DataFrame({'feature':refclasscol,'importance':score})

importances = importances.sort_values('importance',ascending=False).set_index('feature')

# plot importances

plt.rcParams['figure.figsize'] = (11, 4)

importances.plot.bar();

My importance OUTPUT.

Note: your answer may be different

importance

feature

src_bytes 0.111

dst_host_srv_count 0.086

dst_bytes 0.071

service 0.067

logged_in 0.055

dst_host_serror_rate 0.043

dst_host_same_src_port_rate 0.038

dst_host_diff_srv_rate 0.038

Lesson 8; Step 9

! DO NOT USE THE CODE IN YOUR ASSIGNMENT !

USE THE FOLLOWING CODE INSTEAD

(Note: This process take about 16 min to process! )

from sklearn.feature_selection import RFE

import itertools

rfc = RandomForestClassifier()

# create the RFE model and select 10 attributes

rfe = RFE(rfc, n_features_to_select=10)

rfe = rfe.fit(X_res, y_res)

# summarize the selection of the attributes

feature_map = [(i, v) for i, v in itertools.zip_longest(rfe.get_support(), refclasscol)]

selected_features = [v for i, v in feature_map if i==True]

My OUTPUT

['src_bytes',

'dst_bytes',

'logged_in',

'count',

'srv_count',

'dst_host_srv_count',

'dst_host_diff_srv_rate',

'dst_host_same_src_port_rate',

'dst_host_serror_rate',

'service']

Lesson 8; Step 10

To create the sc_testdf, execute the creation codes on the top of the Colab

My OUTPUT

(22544, 41)

Lesson 8; Step 12

USE THIS CODE

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore',sparse=False)

Xresdf = pretrain

newtest = pretest

Xresdfnew = Xresdf[selected_features]

Xresdfnum = Xresdfnew.drop(['service'], axis=1)

Xresdfcat = Xresdfnew[['service']].copy()

Xtest_features = newtest[selected_features]

Xtestdfnum = Xtest_features.drop(['service'], axis=1)

Xtestcat = Xtest_features[['service']].copy()

# Fit train data

enc_fit=enc.fit(Xresdfcat)

enc=enc_fit

# Transform train data

X_train_1hotenc = enc.transform(Xresdfcat)

# Transform test data

X_test_1hotenc = enc.transform(Xtestcat)

X_train = np.concatenate((Xresdfnum.values, X_train_1hotenc), axis=1)

X_test = np.concatenate((Xtestdfnum.values, X_test_1hotenc), axis=1)

y_train = Xresdf[['attack_class']].copy()

c, r = y_train.values.shape

Y_train = y_train.values.reshape(c,)

y_test = newtest[['attack_class']].copy()

c, r = y_test.values.shape

Y_test = y_test.values.reshape(c,)

Lesson 8; Step 13 & Step 14 to be left for students. Only strong students should get full marks for this assignment!

from sklearn import metrics

models = []

#models.append(('SVM Classifier', SVC_Classifier))

models.append(('Naive Baye Classifier', BNB_Classifier))

models.append(('Decision Tree Classifier', DTC_Classifier))

#models.append(('RandomForest Classifier', RF_Classifier))

models.append(('KNeighborsClassifier', KNN_Classifier))

models.append(('LogisticRegression', LGR_Classifier))

#models.append(('VotingClassifier', VotingClassifier))

for i, v in models:

scores = cross_val_score(v, X_train, Y_train, cv=10)

accuracy = metrics.accuracy_score(Y_train, v.predict(X_train))

confusion_matrix = metrics.confusion_matrix(Y_train, v.predict(X_train))

classification = metrics.classification_report(Y_train, v.predict(X_train))

print()

print('============================== {} {} Model Evaluation =============================='.format(grpclass, i))

print()

print ("Cross Validation Mean Score:" "\n", scores.mean())

print()

print ("Model Accuracy:" "\n", accuracy)

print()

print("Confusion matrix:" "\n", confusion_matrix)

print()

print("Classification report:" "\n", classification)

print()

for i, v in models:

accuracy = metrics.accuracy_score(Y_test, v.predict(X_test))

confusion_matrix = metrics.confusion_matrix(Y_test, v.predict(X_test))

classification = metrics.classification_report(Y_test, v.predict(X_test))

print()

print('============================== {} {} Model Test Results =============================='.format(grpclass, i))

print()

print ("Model Accuracy:" "\n", accuracy)

print()

print("Confusion matrix:" "\n", confusion_matrix)

print()

print("Classification report:" "\n", classification)

print()

Lesson 9

Reading List

Additional Notes

https://www.usenix.org/legacy/event/sec08/tech/full_papers/provos/provos.pdf

https://jis-eurasipjournals.springeropen.com/articles/10.1186/s13635-017-0055-6

https://www.researchgate.net/profile/Oemer-Aslan-5/publication/321759520_Investigation_of_Possibilities_to_Detect_Malware_Using_Existing_Tools/links/5a30d45c0f7e9b0d50f90509/Investigation-of-Possibilities-to-Detect-Malware-Using-Existing-Tools.pdf

https://www.av-test.org/en/statistics/malware/

https://books.google.com.sg/books?hl=en&lr=&id=dJQMEAAAQBAJ&oi=fnd&pg=PR3&ots=alB58G0DJR&sig=eyB5iugOfpocsEk0wniOAIzN9FY&redir_esc=y#v=onepage&q&f=false

https://arxiv.org/pdf/1703.10926.pdf

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8949524

https://www.researchgate.net/profile/Rami-Sihwail/publication/328760930_A_Survey_on_Malware_Analysis_Techniques_Static_Dynamic_Hybrid_and_Memory_Analysis/links/5d73c61892851cacdb28d68f/A-Survey-on-Malware-Analysis-Techniques-Static-Dynamic-Hybrid-and-Memory-Analysis.pdf

https://www.researchgate.net/profile/Radu-State/publication/220673433_Malware_behaviour_analysis/links/0912f5141fed0de798000000/Malware-behaviour-analysis.pdf

https://orca.cardiff.ac.uk/id/eprint/29469/2/2012AloseferYPhD.pdf

http://www.malgenomeproject.org/

https://arxiv.org/ftp/arxiv/papers/1607/1607.08166.pdf

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8949524

https://arxiv.org/ftp/arxiv/papers/1406/1406.7061.pdf

Students to watch this video and raise queries if any on Lesson 10 lecture

https://au-lti.bbcollab.com/collab/ui/session/playback

Practical

https://github.com/tuff96/Malware-detection-using-Machine-Learning

Link to practical Google colab:

https://drive.google.com/drive/folders/1ZO1Cd0NspWjEJuOLW7_NcnUNhgj2TJWZ?usp=sharing

Lesson 9 Step 2

Check the path in your code matches the location of your data. There is an 's' in the Malware Datasets

/Colab Notebooks/Datasets/Malware_Datasets/data.csv

Lesson 9 Step 3

dataset['legitimate'].value_counts()

Lesson 9 Step 5

X = dataset.drop('legitimate', axis = 1)

Y = dataset['legitimate']

Lesson 9 Step 6

Remember the name used. It is case-sensative.

I am using the lower-case letter 'f' in feature_selection

feature_selection = ske.ExtraTreesClassifier().fit(X,Y)

My OUTPUT

ExtraTreesClassifier()

ADD

# import the library for data split

from sklearn.feature_selection import SelectFromModel

import sklearn.ensemble as ske

# Store select from model in a varaible called model

model = SelectFromModel(feature_selection, prefit = True)

My OUTPUT

SelectFromModel(estimator=ExtraTreesClassifier(), prefit=True)

DON'T WORRY. This is just a warning message

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:444: UserWarning: X has feature names, but SelectFromModel was fitted without feature names

f"X has feature names, but {self.__class__.__name__} was fitted without"

My X_new OUTPUT

array([[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 3.43726791e+00, 1.60000000e+01],

[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 3.46549876e+00, 1.60000000e+01],

[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 3.46647369e+00, 1.60000000e+01],

...,

[3.32000000e+02, 2.24000000e+02, 8.45000000e+03, ...,

0.00000000e+00, 5.02069508e+00, 1.60000000e+01],

[3.32000000e+02, 2.24000000e+02, 2.71000000e+02, ...,

3.27680000e+04, 5.26917350e+00, 0.00000000e+00],

[3.32000000e+02, 2.24000000e+02, 7.70000000e+02, ...,

1.34400000e+03, 4.91161452e+00, 0.00000000e+00]])

My X_new shape OUTPUT

(10539, 9)

Lesson 9 Step 7

ADD and CHANGE random_state = 42

# import the library for data split

from sklearn.model_selection import train_test_split

# Splitting the dataset into the Training set and Test set

X_train, X_test, y_train, Y_test = train_test_split(X_new, Y, test_size = 0.20, random_state = 42)

My X_train OUTPUT

array([[3.32000000e+02, 2.24000000e+02, 7.83000000e+02, ...,

0.00000000e+00, 5.51383493e+00, 0.00000000e+00],

[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 7.98061409e+00, 1.60000000e+01],

[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 4.87356955e+00, 1.60000000e+01],

...,

[3.32000000e+02, 2.24000000e+02, 2.71000000e+02, ...,

0.00000000e+00, 6.13220605e+00, 1.80000000e+01],

[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 3.44729660e+00, 1.60000000e+01],

[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,

3.30880000e+04, 4.45728744e+00, 0.00000000e+00]])

My y_train OUTPUT

9217 0

2935 1

871 1

1263 1

1618 1

..

5734 0

5191 0

5390 0

860 1

7270 0

Name: legitimate, Length: 8431, dtype: int64

My X_test OUTPUT

array([[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,

1.67360000e+04, 5.18299414e+00, 1.70000000e+01],

[3.32000000e+02, 2.24000000e+02, 3.31670000e+04, ...,

3.27680000e+04, 5.90727483e+00, 1.30000000e+01],

[3.32000000e+02, 2.24000000e+02, 3.31670000e+04, ...,

3.30880000e+04, 7.97187350e+00, 1.50000000e+01],

...,

[3.32000000e+02, 2.24000000e+02, 3.31660000e+04, ...,

0.00000000e+00, 7.45655950e+00, 0.00000000e+00],

[3.32000000e+02, 2.24000000e+02, 2.71000000e+02, ...,

0.00000000e+00, 5.14388114e+00, 1.80000000e+01],

[3.32000000e+02, 2.24000000e+02, 3.31670000e+04, ...,

3.30880000e+04, 7.97187350e+00, 1.50000000e+01]])

My Y_test OUTPUT

518 1

439 1

7061 0

8141 0

8361 0

..

6326 0

10210 0

7474 0

4692 0

3799 0

Name: legitimate, Length: 2108, dtype: int64

Lesson 9 Step 8

My X_train OUTPUT

array([[-0.66719497, -0.65670766, -0.56942055, ..., -1.54623436,

0.01357061, -1.50298017],

[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,

1.31563293, 0.68421049],

[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,

-0.32438645, 0.68421049],

...,

[-0.66719497, -0.65670766, -0.61962315, ..., -1.54623436,

0.33997103, 0.95760933],

[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,

-1.07722898, 0.68421049],

[-0.66719497, -0.65670766, -0.62089782, ..., 0.89643878,

-0.54411639, -1.50298017]])

My X_test OUTPUT

array([[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,

-0.16106007, 0.82090991],

[-0.66719497, -0.65670766, 2.60589354, ..., 0.87281525,

0.22124355, 0.27411224],

[-0.66719497, -0.65670766, 2.60589354, ..., 0.89643878,

1.31101931, 0.54751108],

...,

[-0.66719497, -0.65670766, 2.60579549, ..., -1.54623436,

1.03901646, -1.50298017],

[-0.66719497, -0.65670766, -0.61962315, ..., -1.54623436,

-0.18170544, 0.95760933],

[-0.66719497, -0.65670766, 2.60589354, ..., 0.89643878,

1.31101931, 0.54751108]])

Lesson 9 Step 10

This section is to differentiate the STRONG student from the rest of the class

Students to plot the tree. Just plot ONE tree from a random forest classifier.

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(max_leaf_nodes=None, random_state=0)

classifier.fit(X_train, y_train)

from graphviz import Source

from sklearn import tree

Source(tree.export_graphviz(classifier, out_file=None, feature_names=['Machine','SizeOfOptionalHeader','Characteristics','MajorLinkerVersion','MinorLinkerVersion','SizeOfCode','SizeOfInitializedData','SizeOfUninitializedData','AddressOfEntryPoint','BaseOfCode','BaseOfData','ImageBase'], filled=True))

Lesson 9 Step 11

ADD Code

y_pred = classifier.predict(X_test)

y_pred

My OUTPUT. Your result may be different!

array([1, 1, 0, ..., 0, 0, 0])

ADD Code

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, y_pred)

print(cm)

print(classification_report(Y_test, y_pred))

My OUTPUT. Your result may be different

[[1376 18]

[ 26 688]]

precision recall f1-score support

0 0.98 0.99 0.98 1394

1 0.97 0.96 0.97 714

accuracy 0.98 2108

macro avg 0.98 0.98 0.98 2108

weighted avg 0.98 0.98 0.98 2108

ADD CODE

from sklearn.neighbors import KNeighborsClassifier

classifier_KNN = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)

classifier_KNN.fit(X_train, y_train)

print("KNN Model has been trained")

ADD CODE

y_pred_KNN = classifier_KNN.predict(X_test)

y_pred_KNN

My OUTPUT

array([1, 1, 0, ..., 0, 0, 0])

ADD CODE

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, y_pred_KNN)

print(cm)

print(classification_report(Y_test, y_pred_KNN))

My OUTPUT.

Your output may be different

[[1377 17]

[ 27 687]]

precision recall f1-score support

0 0.98 0.99 0.98 1394

1 0.98 0.96 0.97 714

accuracy 0.98 2108

macro avg 0.98 0.97 0.98 2108

weighted avg 0.98 0.98 0.98 2108

Lesson 10

Reading List

Additional Notes

https://www.usenix.org/system/files/conference/cset18/cset18-paper-marchal.pdf

https://www.tandfonline.com/doi/abs/10.1080/15536548.2016.1139423?casa_token=vuto3IQrrkQAAAAA%3AaM4nR9Ce3FDhZ5FpF3Gai11MBwzYYQhBJw15Qp4CW4JvRfl206rlias27v8qPj0WhxlFVVJ4R1jq&journalCode=uips20

https://ieeexplore.ieee.org/document/6289402

https://dl.acm.org/doi/pdf/10.1145/3025453.3025831

https://www.ripublication.com/ijaer19/ijaerv14n9_15.pdf

https://docs.apwg.org/reports/apwg_trends_report_q1_2020.pdf

https://koreascience.kr/article/JAKO201310554376257.pdf

https://www.usenix.org/system/files/sec19-cidon.pdf

https://dl.acm.org/doi/abs/10.1145/1572532.1572536?casa_token=uaWY0GSm1QEAAAAA%3AD4WjX9bHeLjW7uyUN3S2T9uAwVovWUvJkijgC3MeldZU3u_IUBjWlTNeXhhOpTi0f7IPe-4cceY5

https://www.sciencedirect.com/science/article/pii/S0167404806001581?casa_token=JySafe1ywwUAAAAA:WdR0bu9No33zAsocjiRfytAYnEmHKV8F4ZFQkVSbYJvQxhaOBMfAAOV0D9jnJ1dtJeR9DIBT6Q

https://arxiv.org/pdf/2009.11116.pdf

https://cs229.stanford.edu/proj2012/ZhangYuan-PhishingDetectionUsingNeuralNetwork.pdf

http://www.neuro.nigmatec.ru/materials/themeid_17/riedmiller93direct.pdf

https://core.ac.uk/reader/30732240

https://www.phishtank.com/

https://romisatriawahono.net/lecture/rm/survey/network%20security/Khonji%20-%20Phishing%20Detection%20-%202013.pdf

https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/iet-ifs.2013.0202

https://pdf.sciencedirectassets.com/278653/1-s2.0-S1877705812X00092/1-s2.0-S187770581200940X/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDcaCXVzLWVhc3QtMSJHMEUCIQD7PwO8FuWnZAcCPG%2BexzM%2BHlxwb%2Fh5jyy30wwG9iphVAIgRu5OAe2O057GTvdZmjHD3AWC9XdnkE54vdh1VMIjkCgqzAQIYBAFGgwwNTkwMDM1NDY4NjUiDD%2FnckW9TNr%2F0aS8jiqpBNbQAG4LaOwfLK42zDxy3LghaOMk9YVJ8PNErPgz%2FBRgwNm5ZNOcNk0bI5A1vA7WAIID61%2Bgq1bI3VEYV%2BFHkBxoD3YZs0jmC0E5pE%2F0aCt7dqV9xqj4u4Um%2BH%2FN%2FUJc8WOiqIeHSjwMb2Ft14X4y2pDqdjhTrrnQIeJK5xghFQLbDO0iLokIxrOjfPqsJzIp1%2BdHNoTLQJIc%2B3IapxnEFGvSuIIIIXlyoZQBg6Hvakp9GDovFVURG7FP3qYI4ZMa15tCsp3GqJC9KS1umc2bwlGO130tEJnXL2b8ocCk91nGIVnP48D3jT8JA0ZAm9ivDb3JdTPaOSgXFDZ2n5B1tqtV8Pfm2tTqIBtUgiORKaIlIsVcIJfhQ%2BLveWpOrOdTYzYR3meVGbJGQKmpAYjkW70W4fIQoJkz%2BvwtSIxUky41bIZWzqe2%2BHEGx8y%2F%2FJDytLlYmnfffSQYI9LEeQNN6cNxsQPWsUURMH5DGTjUE3jR4GP3wHJCBJAVAXQVG3ckxK3UGnBfDBPD6YYRmvp8H%2Frec0OHdkAyw84InEVk2jOXGEbfhRbtSxzz1T3bLtQ4%2BvK45BaLNmCuQ0jDoYyPhfNhkthLNCbzmK4aAHZ0Lo7N3D4uJHblrt6V%2BqDJvNAOPEojHHtDmKvlSjzsTkbCraOPH2UkSY5RXTsXGpjacDyBCct%2BmPgnVafcuktyWaOW1kGUz1fCDesMWMjKF%2Bhp5ja96lquZbv1m0wlpbNnAY6qQG1nbGAVKjqLTf13zuNJYmy0wBm62CLs8ZPT%2BWc1A%2BZxMHeCRW6K3d8De51QDxx02K%2FBCIqQBaAMJAZU10L9GXqgazqgDNd1V71Jlo%2FCxFe08gKwnHyiH7g5H%2F5ZNOumdfno4N%2BeUnuaz81M%2BMca9HjsePBBCqERpXt6ICZwZkPn6FGeA%2BibMPoXFCPPB5E0CJXJBXpGlU9PYVbmIavlyGzgzfgPLqBf6Sk&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20221209T153823Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTY5QMQFT5W%2F20221209%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=f6e0285f75233f8d61eb2894d6339316f2ecf14260dc6b9ce919766075409174&hash=f586e13a0cd38ec48bf7066dc40dd12bd22e6d52c2be62c7008dbef7da1b7075&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S187770581200940X&tid=spdf-a70b77f5-4496-4aea-970f-db58bd55910a&sid=e6f76be12f4c234b4039e9b503f8120e1ab9gxrqb&type=client&ua=5751570a075c0a5403&rr=776ed3d2db813969

https://dl.acm.org/doi/abs/10.1145/2851613.2851801?casa_token=m5Lxk4CFuvMAAAAA%3AhDIWwfjXPwQ8KWiYtW1KGqXvjuknDf1IJ0-AKtbQx5HNY7McxcCZNmVTD8dhmu9LZmg1A_drb0ip

Practical

Link to colab:

https://colab.research.google.com/drive/1_10fLtfoJNiQCyIw4BGr9vAKE3K7QH-5?usp=sharing

The code and data are based on the following github::

https://github.com/npapernot/phishing-detection

Lesson 10 Step 1

ADD Code

%matplotlib inline

Lesson 10 Step 3

Change 'Labels'

dataset['Labels'].value_counts()

Lesson 10 Step 4

Change 'Labels'

features = dataset.drop('Labels', axis =1) #dropping the labels and saving the features

Labels = dataset["Labels"]

Change 'shapem' to 'shape'

features.shapem

Lesson 10 Step 7

from sklearn.model_selection import cross_validate #import additonal library for cross validation

scores = cross_validate(Attack_Detection_Model, X_train, Y_train, cv=10, return_train_score=False)

print(scores) #print the scores

My OUTPUT.

Your output may be different

{'fit_time': array([0.00843048, 0.00527406, 0.00514007, 0.00520182, 0.00562763,

0.00496912, 0.00503826, 0.00524092, 0.00513697, 0.00551486]), 'score_time': array([0.00237703, 0.00195217, 0.00189662, 0.00204945, 0.00196123,

0.00186229, 0.00186372, 0.00190568, 0.00188684, 0.00202179]), 'test_score': array([0.96 , 0.94 , 0.965, 0.97 , 0.905, 0.935, 0.95 , 0.95 , 0.97 ,

0.96 ])}

Import first and change the last line 'Graph ' to 'graph' ; small letter 'g'.

Remember case-sensative

import graphviz

# DOT data

dot_data = tree.export_graphviz(Attack_Detection_Model, out_file=None,

feature_names= X_train.columns,

filled=True)

# Draw graph

graph = graphviz.Source(dot_data, format="png")

graph

Lesson 10 Step 8

# Use the trained classifier to make predictions on the test data

predictions = Attack_Detection_Model.predict(X_test)

print(predictions)

print("Predictions on testing data computed.")

My OUTPUT

[-1 -1 1 ... 1 -1 -1]

Predictions on testing data computed.

from sklearn.metrics import confusion_matrix, classification_report #

print(confusion_matrix(Y_test, predictions))

print(classification_report(Y_test, predictions))

My OUTPUT

[[3462 563]

[ 306 4724]]

precision recall f1-score support

-1 0.92 0.86 0.89 4025

1 0.89 0.94 0.92 5030

accuracy 0.90 9055

macro avg 0.91 0.90 0.90 9055

weighted avg 0.90 0.90 0.90 9055

https://arxiv.org/ftp/arxiv/papers/1811/1811.00921.pdf

https://www.sciencedirect.com/science/article/abs/pii/S1389128612004124?casa_token=LMmvbMuZP0YAAAAA:ufJTFhHFaZD9m6a6TrpcuhVzzA3O5OyoSHMYQU8XqLzTS6ev_PAwe5WL4fvSt8OXFkTXLrqynw

https://www.sciencedirect.com/science/article/pii/S0167404819302032?casa_token=SwFQztmgAy0AAAAA:mNbWgsZ4Gtl-SwY25qKv8SRb1OVWNdZlrukGM7scTWDA3ed0cn91R4EKI7OJOh-f45oa7PXDOQ

https://dl.acm.org/doi/abs/10.1145/3339252.3340513?casa_token=s2OaafmANZcAAAAA%3AlBnvytElXCIqghyloLIri9NwEN96TIYFIlz8DyfY6-afFA_pAKnm5GZfVivHJg5FFuqKJCdTKEbC

https://www.kaggle.com/