Contact: +6017-761 9288
CP2421 Have been officially renamed to "Machine Learning for Cybersecurity"
Examination (centrally administered) - (40%)
Case study analysis - (30%)
Project report - (30%)
Introduction Machine Learning Approach
Log Analysis for Web Security [Log Analysis; Python]
Botnet Detection [Botnet detection; Python]
User Activity Keystroke Analysis [Keystroke analysis; Weka]
User Activity Mouse Dynamics Analysis [Mouse Dynamic Analysis; Python]
Adversarial Machine Learning [Adversarial ML; Python]
Machine Learning for Web Security; Deep Packet Inspection [Deep packet inspection; Python]
Machine Learning for Intrusion Detection [Intrusion Detection; Python]
Malware Detection using ML [Malware Detection; Python]
Phishing Detection using ML & Automated Alert Correlation [Phishing detection; Python]
MODEL: a mathematical representation of a real world process; a predictive model forecasts a future outcome based on past behaviors.
TRAINING: the process of creating a model from the training data. The data is fed into the training algorithm, which learns a representation for the problem, and produces a model. Also called “learning.”
CLASSIFICATION : a prediction method that assigns each data point to a predefined category, e.g., a type of operating system.
TRAINING SET: a dataset used to find potentially predictive relationships that will be used to create a model.
FEATURE: also known as an independent variable or a predictor variable, a feature is an observable quantity, recorded and used by a prediction model. You can also engineer features by combining them or adding new information to them.
ALGORITHM: a set of rules used to make a calculation or solve a problem.
REGRESSION: a prediction method whose output is a real number, that is, a value that represents a quantity along a line.
TARGET: in statistics, it is called the dependent variable; it is the output of the model or the variable you wish to predict.
TEST SET: a dataset, separate from the training set but with the same structure, used to measure and benchmark the performance of various models.
OVERFITTING: a situation in which a model that is too complex for the data has been trained to predict the target. This leads to an overly specialized model, which makes predictions that do not reflect the reality of the underlying relationship between the features and target.
Confusion Matrix
Machine learning: where a system improves its performance through analysis of previous performance
Unsupervised learning: where the machine learning takes place entirely through the system analysing and categorising the available data
Supervised learning: where sample data is supplied to the system with associated data relating to the outcome of its use
Reinforcement learning: where an agent learns by receiving graded rewards for actions taken
Student Requirement
Google Account
Google Drive
Readings for Lesson 1
Pandas documentation: https://pandas.pydata.org/docs/
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
import pandas as pd
members = ["Brazil", "Russia", "India", "China", "South Africa"]
brics1 = pd.Series(members)
brics1
type(brics1)
members = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
"capital": ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"gdp": [2750, 1658, 3202, 15270, 370],
"literacy": [0.944, 0.997, 0.721, 0.964, 0.943],
"expectancy": [76.8, 72.7, 68.8, 76.4, 63.6],
"population": [210.87, 143.96, 1367.09, 1415.05, 57.4]
}
brics2 = pd.DataFrame(members)
brics2
type(brics2)
members = [["Brazil", "Brasilia", 2750, 0.944, 76.8, 210.87],
["Russia", "Moscow", 1658, 0.997, 72.7, 143.96],
["India", "New Delhi", 3202, 0.721, 68.8, 1367.09],
["China", "Beijing", 15270, 0.964, 76.4, 1415.05],
["South Africa", "Pretoria", 370, 0.943, 63.6, 57.4]]
labels = ["country", "capital", "gdp", "literacy", "expectancy", "population"]
brics3 = pd.DataFrame(members, columns = labels)
brics3
brics4 = pd.read_csv("brics.csv")
brics4
brics5 = pd.read_excel("brics.xlsx")
brics5
brics6 = pd.read_excel("brics.xlsx", sheet_name = "Summits")
brics6
Reading for Lesson 2
https://www.sciencedirect.com/science/article/pii/S0167404820300250
https://www.ptsecurity.com/ww-en/analytics/web-vulnerabilities-2020/
https://developer.nvidia.com/blog/cybert-rapids-ai/
http://opendl.ifip-tc6.org/db/conf/im/im2013/MakanjuZM13.pdf
https://dr.lib.iastate.edu/handle/20.500.12876/17031/
https://jis-eurasipjournals.springeropen.com/articles/10.1186/s13635-018-0081-z
https://cse.sc.edu/~huangct/CSCE813F16/07544930.pdf
The practicals have been used from Walid daboubi’s github:
https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning
Link to colab:
https://colab.research.google.com/drive/1x0LbDfN2yj9vYQmW_aRvxkeI41CIAkWA?usp=sharing
The dataset is obtained from the following link: http://www.secrepo.com /self.logs/
Get the dec_2016.csv and feb_2017.csv from https://drive.google.com/drive/folders/1AoB_mBMVKU2owkVBr4xxzKCgKmh-BlYb?usp=sharing
Allow Colab to access Google Drive
"Permit this notebook to access your Google Drive files?
This notebook is requesting access to your Google Drive files. Granting access to Google Drive will permit code executed in the notebook to modify files in your Google Drive. Make sure to review notebook code prior to allowing this access."
=-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-3de3a6570295> in <module>
# Train the classifier
---> attack_classifier = attack_classifier.fit(training_features, traning_labels)
# Get predections for the testing data
NameError: name 'traning_labels' is not defined
For the Project Report, you are allow to use the Logistic Regression
Reading List
https://www.academia.edu/download/36990790/A_Survey_of_Botnet_and_Botnet_Detection.pdf
https://www.spamhaus.com/custom-content/uploads/2020/04/2019-Botnet-Threat-Report-2019-LR.pdf
https://securelist.com/bots-and-botnets-in-2018/90091/
https://ieeexplore.ieee.org/document/8026031
https://link.springer.com/chapter/10.1007/978-3-642-01440-6_27
https://ieeexplore.ieee.org/document/4569852
https://ieeexplore.ieee.org/document/5455789
Practical
based on the following Github code:
https://github.com/ShehzadaAlam/Botnet-Detection/blob/master/Botnet%20Detection.ipynb
Link to colab: https://colab.research.google.com/drive/15nFlwvwueZjfp5MrVyb9EfuNk4cwCXZE?usp=sharing [I would suggest making at least 2 copies of this colab]
Data from here: https://drive.google.com/drive/folders/14SLU9--GnB8hbdVyZ0rUtvb15TWOExkn?usp=sharing [right-click and make a copy. Then move the data to your Colab folder. Rename to capture20110810.binetflow.2format ]
https://mcfp.weebly.com/ctu-malware-capture-botnet-42.html [download the capture20110810.pcap.netflow.labeled file. It is 370 Mb]
Note: The training process took me about 3 minutes.
Note : Make sure to use the same variables as the ones set by you.
dataset = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Datasets/capture20110810.binetflow")
dataset = dataset[dataset['Label'].str.contains('Botnet')]
dataset.head() #print 5 rows
Optional:
Use dataset.describe() instead of dataset.description()
Graph may be different but you should get close to 0.30
#Omit LastTime as this column is NOT in dataset
dataset = dataset.astype({"Proto":'category',"Sport":'category',"Dport":'category',"State":'category','StartTime':'datetime64[s]'})
dataset
# NOT Required as we can use the dur (Duration) column directly
# Getting duration from the columns 'LastTime' and 'StartTime'
dataset['duration'] = abs(dataset['LastTime'].dt.second - dataset['StartTime'].dt.second)
# Drop the selected columns excluding LastTime
# LastTime does NOT exist in the dataset
dataset.drop(columns=['SrcAddr','DstAddr','StartTime'],inplace=True)
dataset
# List the column names to be removed from X (including the Dir column as it is NOT numeric)
columns = ["Proto", "Sport", "Dport", "State", "Dir"]
# Drop columns having NaN values. Exclude the sHops & sTtl as these 2 column does NOT exist
columns = ["sTos", "sHops", "sTtl"]
X = X.drop(columns, axis=1)
X
*Note you may get a different number
Reading List
https://www.iii.org/fact-statistic/facts-statistics-identity-theft-and-cybercrime
https://www.theseus.fi/bitstream/handle/10024/44684/Babich_Aleksandra.pdf
https://link.springer.com/article/10.1007/s13173-013-0117-7
http://www.cs.cmu.edu/afs/cs/Web/People/maxion/pubs/KillourhyMaxion09.pdf
https://link.springer.com/content/pdf/10.1007/978-3-540-74549-5_125.pdf
For the practical, you need to download and install WEKA on your local computer:
https://waikato.github.io/weka-wiki/downloading_weka/
** This can take time, therefore it is recommended to install WEKA before you attend the practical.
Practical
Download the DSL-StrongPasswordData.csv file from here:
http://www.cs.cmu.edu/~keystroke/ OR
https://drive.google.com/file/d/1LLCPlzYXvZbeX-o7EalsdFMpfulV17_-/view?usp=sharing
RandomForest -P 100 -print -I 300 -num-slots 1 -K 34 -M 1.0 -V 0.001 -S 1 -batch-size 200
In case you should run into an OutOfMemoryException,
Try again. Sometimes the 1st run is NOT successful.
If the error presist, you will have to increase the maximum heap size.
How much you can allocate, depends heavily on the operating system and the underlying hardware.
Locate and change the RunWeka.ini file. Note: I have a 8 Gb RAM machine.
# The JAVA_OPTS environment variable (if set). Can be used as an alternative way to set
# the heap size (or any other JVM option)
#javaOpts=%JAVA_OPTS%
javaOpts=%JAVA_OPTS% -Xmx8048m
Correctly Classified Instances 1830 89.7059 %
Incorrectly Classified Instances 210 10.2941 %
Kappa statistic 0.8949
Mean absolute error 0.0113
Root mean squared error 0.0634
Relative absolute error 29.2818 %
Root relative squared error 45.7395 %
Total Number of Instances 2040
MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 1 -E 20 -H a -G -R -batch-size 200
Correctly Classified Instances 1720 84.3137 %
Incorrectly Classified Instances 320 15.6863 %
Kappa statistic 0.8399
Mean absolute error 0.0081
Root mean squared error 0.0714
Relative absolute error 21.0242 %
Root relative squared error 51.4638 %
Total Number of Instances 2040
Reading List
Additional Notes
Practical
Download the following two CSV (balabit_39feat_PC_MM_DD_250.csv & balabit_39feat_PC_MM_DD_50.csv) files from the link:
https://drive.google.com/drive/folders/1Yqe2Hw2ECPSu3nTIk2-wQe0dVUAUIXd2
https://colab.research.google.com/drive/1Q3-AgQdNGkeozgqfOg9dMydulEuzpptV?usp=sharing
numpy is a fundamental package for scientific computing with Python. More info at https://numpy.org/
import sys
import warnings
import copy
from sklearn import model_selection, metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score
import numpy as np
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
MyDrive/Colab Notebooks/Datasets/Mouse_Dynamics_Data/features/
*Note: Column and Shape step can swap place.
Output of info()
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type_of_action 500 non-null int64
1 traveled_distance_pixel 500 non-null float64
2 elapsed_time 500 non-null float64
3 direction_of_movement 500 non-null int64
4 straightness 500 non-null float64
5 num_points 500 non-null int64
6 sum_of_angles 500 non-null float64
7 mean_curv 500 non-null float64
8 sd_curv 500 non-null float64
9 max_curv 500 non-null float64
10 min_curv 500 non-null float64
11 mean_omega 500 non-null float64
12 sd_omega 500 non-null float64
13 max_omega 500 non-null float64
14 min_omega 500 non-null float64
15 largest_deviation 500 non-null float64
16 dist_end_to_end_line 500 non-null float64
17 num_critical_points 500 non-null int64
18 mean_vx 500 non-null float64
19 sd_vx 500 non-null float64
20 max_vx 500 non-null float64
21 min_vx 500 non-null float64
22 mean_vy 500 non-null float64
23 sd_vy 500 non-null float64
24 max_vy 500 non-null float64
25 min_vy 500 non-null float64
26 mean_v 500 non-null float64
27 sd_v 500 non-null float64
28 max_v 500 non-null float64
29 min_v 500 non-null float64
30 mean_a 500 non-null float64
31 sd_a 500 non-null float64
32 max_a 500 non-null float64
33 min_a 500 non-null float64
34 mean_jerk 500 non-null float64
35 sd_jerk 500 non-null float64
36 max_jerk 500 non-null float64
37 min_jerk 500 non-null float64
38 a_beg_time 500 non-null float64
39 userid 500 non-null int64
Output of ['userid'].value_counts()
*Note: The order may NOT be the same but the content is the SAME
2 250
15 250
16 250
20 250
21 250
23 250
29 250
35 250
7 250
9 250
Name: userid, dtype: int64
Note: the "name of the column " is userid
X_test = test_data.drop('userid', axis =1)
Y_test = test_data['userid']
#First import the necessary library
from sklearn import ensemble
# create a classifier with configuration parameters
model = RandomForestClassifier(n_estimators=200, max_leaf_nodes=5 ,random_state=None) #intialize the random forest classifier
model.fit(X_train, Y_train) #train the model using .fit() and training features and labels
Output shows the parameters defined for the classifier configuration, and also confirms that the training has been completed
RandomForestClassifier(max_leaf_nodes=5, n_estimators=200)
scores = cross_validate(model, X_train, Y_train, cv=10, return_train_score=False) #calculate the cross validation score and store in a varaible score
print(scores) #print the achieved score
My OUTPUT:
*Note: Result may be different
{'fit_time': array([0.9304142 , 0.96811843, 0.97566175, 0.94981956, 0.98914838,
1.08338332, 0.93859506, 0.95835304, 0.94819069, 0.93062997]), 'score_time': array([0.03033805, 0.02822661, 0.02799368, 0.02524114, 0.02984905,
0.02503633, 0.02721095, 0.02906919, 0.03139567, 0.02880883]), 'test_score': array([0.312, 0.328, 0.296, 0.356, 0.292, 0.336, 0.292, 0.332, 0.28 ,
0.34 ])}
#import additional libraries for visulazing random forest classifier
from graphviz import Source
from sklearn import tree
from sklearn.tree import export_graphviz
#Extrcat a single tree and store in estrimation
estimator = model.estimators_[5]
# Export as dot file
Source(tree.export_graphviz(estimator, out_file=None, feature_names=X_train.columns))
OUTPUT: A TREE Diagram
#import additional libraries for plotting the confusion matrix and classification report
from sklearn.metrics import confusion_matrix, classification_report
y_pred = model.predict(X_test) #get the predicitons using testing features and save them in y_pred
print(y_pred) #print the predictions
print(confusion_matrix(Y_test, y_pred)) #print the confusion matrix
print(classification_report(Y_test, y_pred)) #print the classification report
My Output
*Note: The prediction might differ. Please don’t assume that its incorrect if the predictions are different
[12 7 15 12 15 16 20 15 35 29 35 20 15 29 12 15 12 23 15 29 12 12 20 12
16 15 15 12 12 15 12 20 20 29 29 29 29 16 29 9 12 35 7 12 16 12 15 7
15 15 15 15 15 15 12 21 35 29 35 29 29 12 12 12 12 15 12 12 12 12 15 15
12 12 15 29 15 15 12 9 23 12 16 7 12 7 15 15 29 7 23 29 15 29 16 29
15 20 15 15 29 29 29 12 29 29 16 16 16 29 16 12 15 29 23 23 23 15 23 23
15 15 15 29 16 15 12 15 15 15 23 23 12 15 29 29 35 12 9 15 29 20 12 12
16 20 12 15 29 29 7 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 7
15 20 20 20 20 20 20 20 20 20 20 20 20 20 20 7 20 20 20 20 20 20 16 20
20 15 20 20 20 20 20 15 23 29 15 23 15 16 12 12 21 23 35 35 7 29 20 12
7 12 29 9 29 7 29 15 15 15 12 20 29 15 15 15 12 20 7 15 29 15 29 29
29 7 29 7 9 29 9 15 9 15 29 23 15 12 29 29 23 15 23 12 23 7 16 15
12 29 23 23 23 7 23 15 29 15 15 29 15 29 29 15 20 12 29 29 20 15 23 12
29 20 7 12 12 15 12 29 35 29 29 15 29 29 29 15 15 12 12 16 29 29 29 35
7 29 23 29 29 16 29 15 15 12 12 29 15 15 29 29 29 29 35 29 7 29 29 29
23 23 15 15 29 29 29 29 29 20 20 29 29 12 29 23 29 29 29 29 7 35 29 23
35 29 29 29 29 29 29 15 7 16 29 7 29 9 7 29 9 29 29 29 29 35 20 29
29 29 7 29 29 35 29 29 29 29 23 29 12 29 35 12 16 20 20 20 20 20 20 20
7 20 7 16 16 7 20 16 16 7 7 7 16 7 15 15 20 7 7 7 7 15 20 7
7 20 20 20 7 29 20 7 29 7 20 20 7 7 7 7 9 7 9 7 9 9 15 9
9 9 9 9 9 9 7 9 9 9 9 12 20 9 9 9 9 9 9 29 9 9 7 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 20 9 9 9 9]
Note : Confusion matrix and classification might differ. Please don’t assume that its incorrect, if the values don't match.
[[21 1 0 3 6 17 0 0 2 0]
[ 3 42 1 1 0 2 0 0 1 0]
[ 3 1 13 12 4 5 0 1 8 3]
[ 3 1 14 16 2 1 1 2 8 2]
[ 0 1 8 12 6 2 0 7 13 1]
[ 3 0 0 3 1 43 0 0 0 0]
[ 6 4 6 12 1 3 1 3 12 2]
[ 3 0 8 11 1 3 0 9 14 1]
[ 2 0 5 8 2 2 0 3 26 2]
[ 5 2 2 1 1 1 0 3 30 5]]
precision recall f1-score support
7 0.43 0.42 0.42 50
9 0.81 0.84 0.82 50
12 0.23 0.26 0.24 50
15 0.20 0.32 0.25 50
16 0.25 0.12 0.16 50
20 0.54 0.86 0.67 50
21 0.50 0.02 0.04 50
23 0.32 0.18 0.23 50
29 0.23 0.52 0.32 50
35 0.31 0.10 0.15 50
accuracy 0.36 500
macro avg 0.38 0.36 0.33 500
weighted avg 0.38 0.36 0.33 500
Reading List
Additional Notes
https://pdfs.semanticscholar.org/57c5/2c98730c26290b2044ad45924e58cb2fb5cf.pdf
https://arxiv.org/pdf/1412.6572.pdf
https://proceedings.mlr.press/v9/kloft10a/kloft10a.pdf
https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_tramer.pdf
https://arxiv.org/pdf/1704.03453.pdf?source=post_page---------------------------
https://arxiv.org/pdf/1605.07277.pdf
https://pdfs.semanticscholar.org/57c5/2c98730c26290b2044ad45924e58cb2fb5cf.pdf
https://arxiv.org/pdf/1704.01155.pdf
Practical
!pip install adversarial-robustness-toolbox
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from matplotlib import pyplot as plt
from art.estimators.classification import SklearnClassifier
from art.attacks.evasion import ZooAttack
from art.utils import load_mnist
import warnings
warnings.filterwarnings('ignore')
To print x_test
x_test
Comment the n_samples_
#n_samples_train = x_train.shape[0] #calculate the number of samples for the training dataset by taking 0th position from x_train.shape ([60000, 28, 28, 1]), which 60000
n_features_train = x_train.shape[1] * x_train.shape[2] * x_train.shape[3] #compute the number of training features from x_train.shape ([60000, 28, 28, 1]), whicn multiplies 28*28*1
#n_samples_test = x_test.shape[0] #calculate the number of samples for the testing dataset by taking 0th position from x_test.shape ([10000, 28, 28, 1]), which 10000
n_features_test = x_test.shape[1] * x_test.shape[2] * x_test.shape[3]
Typo correction to code "samples"
n_samples_test = x_test.shape[0]
n_samples_train = x_train.shape[0]
x_test X #printing x_test
Add the following code and run it separately
# Instantiate a classifier
from sklearn.ensemble import RandomForestClassifier
#defining the random forest classifier
model = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,
warm_start=False, class_weight=None)
Does NOT produce output
Execute model.fit code
model.fit(x_train, y_train) #fit the model x_train and y_train
Note : It will take time as adversarial samples to generate.
x_train_adv
Output
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
x_test_adv
Output
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Execute first before starting this step
#defining the random forest classifier
model_testing = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,
warm_start=False, class_weight=None)
Next to Execute
#training_the_model to with normal x_train and y_train
model_testing.fit(x_train, y_train)
My outputs for y_pred, confusion matrix and classification matrix for random forest classifier .
Result might differ. Please don’t assume that its incorrect if the predictions are different
[7 2 1 0 4 1 9 9 0 9 0 2 9 0 1 7 4 7 2 9 9 6 2 4 4 0 7 4 0 1 3 1 3 4 9 2 7
1 1 1 1 7 4 1 3 0 1 2 4 4 6 0 4 7 2 3 4 1 4 5 7 0 4 2 2 3 1 4 3 2 7 0 2 8
1 9 3 7 7 7 9 6 2 9 5 4 7 3 2 1 3 6 9 3 1 0 1 3 2 4 2 0 7 4 4 4 0 1 9 4 3
1 3 9 7 9 4 4 4 2 3 4 7 6 9 9 0 5 3 0 6 6 8 2 8 1 0 1 6 4 2 7 7 1 4 1 6 2
0 4 4 7 0 0 1 9 2 0 2 3 4 2 5 4 2 3 0 5 1 9 9 1 2 7 1 1 1 8 1 5 1 2 5 0 3
4 2 3 0 1 1 1 4 7 0 5 1 0 3 8]
[[14 0 1 1 1 0 0 0 0 0]
[ 0 28 0 0 0 0 0 0 0 0]
[ 1 3 9 0 1 0 0 1 1 0]
[ 1 0 3 9 0 1 0 2 0 0]
[ 2 0 0 3 18 0 0 0 0 5]
[ 5 0 0 3 2 5 0 3 1 1]
[ 1 1 10 0 0 0 8 0 0 0]
[ 0 3 2 1 1 0 0 12 0 5]
[ 1 0 1 3 0 2 1 0 2 0]
[ 0 0 0 0 9 0 0 3 1 8]]
precision recall f1-score support
0 0.56 0.82 0.67 17
1 0.80 1.00 0.89 28
2 0.35 0.56 0.43 16
3 0.45 0.56 0.50 16
4 0.56 0.64 0.60 28
5 0.62 0.25 0.36 20
6 0.89 0.40 0.55 20
7 0.57 0.50 0.53 24
8 0.40 0.20 0.27 10
9 0.42 0.38 0.40 21
accuracy 0.56 200
macro avg 0.56 0.53 0.52 200
weighted avg 0.59 0.56 0.55 200
Add the code BUT REMOVE min_impurity_split=None
#defining the random forest classifier
model_testing = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,warm_start=False, class_weight=None)
Add last set of code
#import the necessary libraries to print confusion matrix and classiifcation report
from sklearn.metrics import confusion_matrix, classification_report
y_pred = model_testing.predict(x_test_adv) #predict the values using the adversarial samples generated from x_test_adv, and store in y_pred
print(y_pred) #print the predictions
print(confusion_matrix(y_test, y_pred)) #print confusion matrix, use y_pred and y_test
print(classification_report(y_test, y_pred)) #print classification report, use y_pred and y_test
My result
Out might differ. Please don’t assume that its incorrect if the predictions are different.
[7 2 1 0 4 1 3 3 2 9 0 0 9 4 1 3 4 7 3 4 9 6 4 4 4 0 7 4 0 1 3 1 3 4 9 3 7
1 3 1 7 7 9 1 3 8 1 6 4 4 6 2 0 3 0 0 4 1 4 7 1 4 4 2 9 9 6 4 3 0 7 0 3 8
1 4 3 1 4 7 7 6 2 7 8 4 7 3 4 1 0 6 1 3 1 6 1 3 6 4 4 0 0 4 4 4 0 2 9 4 8
1 9 4 4 4 4 4 4 2 3 4 7 6 9 4 2 9 8 9 6 6 3 2 8 1 9 1 6 4 6 7 4 1 4 1 8 3
0 6 4 2 0 6 1 9 6 0 2 1 4 4 8 4 4 3 4 5 1 4 9 8 2 3 2 7 1 4 1 1 1 7 8 0 4
4 1 3 0 1 1 1 2 3 0 3 1 6 4 8]
[[13 0 2 0 1 0 0 0 0 1]
[ 0 26 1 0 0 0 0 1 0 0]
[ 1 3 5 4 0 0 2 0 1 0]
[ 1 0 3 10 1 0 0 0 0 1]
[ 0 1 0 1 22 0 1 0 0 3]
[ 3 0 1 6 1 1 1 1 3 3]
[ 2 0 0 0 5 0 13 0 0 0]
[ 0 2 1 1 3 0 0 13 1 3]
[ 0 1 0 0 3 0 0 1 5 0]
[ 0 1 1 2 12 0 0 0 1 4]]
precision recall f1-score support
0 0.65 0.76 0.70 17
1 0.76 0.93 0.84 28
2 0.36 0.31 0.33 16
3 0.42 0.62 0.50 16
4 0.46 0.79 0.58 28
5 1.00 0.05 0.10 20
6 0.76 0.65 0.70 20
7 0.81 0.54 0.65 24
8 0.45 0.50 0.48 10
9 0.27 0.19 0.22 21
accuracy 0.56 200
macro avg 0.59 0.53 0.51 200
weighted avg 0.61 0.56 0.53 200
Reading List
Additional Notes
https://www.ptsecurity.com/ww-en/analytics/web-vulnerabilities-2020/
https://www.cc.gatech.edu/fac/Alex.Orso/papers/halfond.viegas.orso.ISSSE06.pdf
https://ieeexplore.ieee.org/document/8567980
http://users.ics.aalto.fi/mpolla/publications/polla08multinomial.pdf
https://koreascience.kr/article/JAKO201726163355323.pdf
https://www.sciencedirect.com/science/article/pii/S2405959518300493
Practical
Note: A few of the links are DOWN
Link to colab:
https://colab.research.google.com/drive/1qamjzAO3wXkXi0_sWfL7ttNwvFFcdfcH?usp=sharing
The data file is here. Use the Copy feature is easier
https://drive.google.com/drive/folders/1YPb0-n8ZkYitc1qw0aLQryyJHJzwaXn6?usp=sharing
Change the "From" to "from". Python is case-sensative
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
Data to be placed here:
MyDrive/Colab Notebooks/Datasets/Intrusion_Dataset/GeneratedLabelledFlows/
Change the code and Execute to read the dataset
#read dataset
dataset = pd.read_csv(path_to_dataset, engine='python')
dataset = pd.read_csv(path_to_dataset,encoding='cp1252')
A Warning appear but it's OK
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (0,1,3,6,84) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Ignore this instruction in your note:
Note: change the condition == “BENIGN” to !=“BENIGN”.
Just execute the next set of code
My Output:
The 2180 attack labels is the TOTAL of Non-BENIGN
BENIGN 5087
Web Attack – Brute Force 1507
Web Attack – XSS 652
Web Attack – Sql Injection 21
Name: Label, dtype: int64
Execute 1st
dataset_balanced.to_csv("web_attacks_balanced.csv", index=False)
The dataset_balanced_attack output may be different
Y_train & Y_test output may be different
My Output of cross_val_score
<function sklearn.model_selection._validation.cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)>
My output of Source(tree.export_graphviz(decision_tree, out_file=None, feature_names=X.columns))
gini = 0.0
samples = 5086
value = 5086.0
Confusion matrix and classification might differ. Please don’t assume that its incorrect, if the values don't match are different
precision recall f1-score support
0 0.00 0.00 0.00 0
1 1.00 1.00 1.00 2181
micro avg 1.00 1.00 1.00 2181
macro avg 0.50 0.50 0.50 2181
weighted avg 1.00 1.00 1.00 2181
Reading List
Practical
Get a copy of the following 2 files (KDD_Train.txt & KDD_Test.txt) from:
https://drive.google.com/drive/folders/1YPb0-n8ZkYitc1qw0aLQryyJHJzwaXn6
Link to colab:
https://colab.research.google.com/drive/1j-KHUV0ZdXrctQ6yBBEI5Ydz7J4lP3e5?usp=sharing
Display the statistical information about the training dataset
dfkdd_train.describe()
Add code to remove a column called num_outbound_cmds
dfkdd_train.drop(['num_outbound_cmds'], axis=1, inplace=True)
dfkdd_test.drop(['num_outbound_cmds'], axis=1, inplace=True)
Add to print column
dfkdd_train.columns
Add to print column
dfkdd_test.columns
Click the DOWN arrow to expand the codes
Add these 2 lines of code
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
Change the sample() to resample()
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_res)))
I am using (11,4) as size but you can use other dimension
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier();
# fit random forest classifier on the training set
rfc.fit(X_res, y_res);
# extract important features
score = np.round(rfc.feature_importances_,3)
importances = pd.DataFrame({'feature':refclasscol,'importance':score})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
# plot importances
plt.rcParams['figure.figsize'] = (11, 4)
importances.plot.bar();
My importance OUTPUT.
Note: your answer may be different
importance
feature
src_bytes 0.111
dst_host_srv_count 0.086
dst_bytes 0.071
service 0.067
logged_in 0.055
dst_host_serror_rate 0.043
dst_host_same_src_port_rate 0.038
dst_host_diff_srv_rate 0.038
! DO NOT USE THE CODE IN YOUR ASSIGNMENT !
USE THE FOLLOWING CODE INSTEAD
(Note: This process take about 16 min to process! )
from sklearn.feature_selection import RFE
import itertools
rfc = RandomForestClassifier()
# create the RFE model and select 10 attributes
rfe = RFE(rfc, n_features_to_select=10)
rfe = rfe.fit(X_res, y_res)
# summarize the selection of the attributes
feature_map = [(i, v) for i, v in itertools.zip_longest(rfe.get_support(), refclasscol)]
selected_features = [v for i, v in feature_map if i==True]
My OUTPUT
['src_bytes',
'dst_bytes',
'logged_in',
'count',
'srv_count',
'dst_host_srv_count',
'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate',
'dst_host_serror_rate',
'service']
To create the sc_testdf, execute the creation codes on the top of the Colab
My OUTPUT
(22544, 41)
USE THIS CODE
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore',sparse=False)
Xresdf = pretrain
newtest = pretest
Xresdfnew = Xresdf[selected_features]
Xresdfnum = Xresdfnew.drop(['service'], axis=1)
Xresdfcat = Xresdfnew[['service']].copy()
Xtest_features = newtest[selected_features]
Xtestdfnum = Xtest_features.drop(['service'], axis=1)
Xtestcat = Xtest_features[['service']].copy()
# Fit train data
enc_fit=enc.fit(Xresdfcat)
enc=enc_fit
# Transform train data
X_train_1hotenc = enc.transform(Xresdfcat)
# Transform test data
X_test_1hotenc = enc.transform(Xtestcat)
X_train = np.concatenate((Xresdfnum.values, X_train_1hotenc), axis=1)
X_test = np.concatenate((Xtestdfnum.values, X_test_1hotenc), axis=1)
y_train = Xresdf[['attack_class']].copy()
c, r = y_train.values.shape
Y_train = y_train.values.reshape(c,)
y_test = newtest[['attack_class']].copy()
c, r = y_test.values.shape
Y_test = y_test.values.reshape(c,)
from sklearn import metrics
models = []
#models.append(('SVM Classifier', SVC_Classifier))
models.append(('Naive Baye Classifier', BNB_Classifier))
models.append(('Decision Tree Classifier', DTC_Classifier))
#models.append(('RandomForest Classifier', RF_Classifier))
models.append(('KNeighborsClassifier', KNN_Classifier))
models.append(('LogisticRegression', LGR_Classifier))
#models.append(('VotingClassifier', VotingClassifier))
for i, v in models:
scores = cross_val_score(v, X_train, Y_train, cv=10)
accuracy = metrics.accuracy_score(Y_train, v.predict(X_train))
confusion_matrix = metrics.confusion_matrix(Y_train, v.predict(X_train))
classification = metrics.classification_report(Y_train, v.predict(X_train))
print()
print('============================== {} {} Model Evaluation =============================='.format(grpclass, i))
print()
print ("Cross Validation Mean Score:" "\n", scores.mean())
print()
print ("Model Accuracy:" "\n", accuracy)
print()
print("Confusion matrix:" "\n", confusion_matrix)
print()
print("Classification report:" "\n", classification)
print()
for i, v in models:
accuracy = metrics.accuracy_score(Y_test, v.predict(X_test))
confusion_matrix = metrics.confusion_matrix(Y_test, v.predict(X_test))
classification = metrics.classification_report(Y_test, v.predict(X_test))
print()
print('============================== {} {} Model Test Results =============================='.format(grpclass, i))
print()
print ("Model Accuracy:" "\n", accuracy)
print()
print("Confusion matrix:" "\n", confusion_matrix)
print()
print("Classification report:" "\n", classification)
print()
Reading List
Additional Notes
https://www.usenix.org/legacy/event/sec08/tech/full_papers/provos/provos.pdf
https://jis-eurasipjournals.springeropen.com/articles/10.1186/s13635-017-0055-6
https://www.av-test.org/en/statistics/malware/
https://arxiv.org/pdf/1703.10926.pdf
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8949524
https://orca.cardiff.ac.uk/id/eprint/29469/2/2012AloseferYPhD.pdf
http://www.malgenomeproject.org/
https://arxiv.org/ftp/arxiv/papers/1607/1607.08166.pdf
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8949524
Students to watch this video and raise queries if any on Lesson 10 lecture
Practical
Check the path in your code matches the location of your data. There is an 's' in the Malware Datasets
/Colab Notebooks/Datasets/Malware_Datasets/data.csv
dataset['legitimate'].value_counts()
X = dataset.drop('legitimate', axis = 1)
Y = dataset['legitimate']
Remember the name used. It is case-sensative.
I am using the lower-case letter 'f' in feature_selection
feature_selection = ske.ExtraTreesClassifier().fit(X,Y)
My OUTPUT
ExtraTreesClassifier()
ADD
# import the library for data split
from sklearn.feature_selection import SelectFromModel
import sklearn.ensemble as ske
# Store select from model in a varaible called model
model = SelectFromModel(feature_selection, prefit = True)
My OUTPUT
SelectFromModel(estimator=ExtraTreesClassifier(), prefit=True)
DON'T WORRY. This is just a warning message
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:444: UserWarning: X has feature names, but SelectFromModel was fitted without feature names
f"X has feature names, but {self.__class__.__name__} was fitted without"
My X_new OUTPUT
array([[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 3.43726791e+00, 1.60000000e+01],
[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 3.46549876e+00, 1.60000000e+01],
[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 3.46647369e+00, 1.60000000e+01],
...,
[3.32000000e+02, 2.24000000e+02, 8.45000000e+03, ...,
0.00000000e+00, 5.02069508e+00, 1.60000000e+01],
[3.32000000e+02, 2.24000000e+02, 2.71000000e+02, ...,
3.27680000e+04, 5.26917350e+00, 0.00000000e+00],
[3.32000000e+02, 2.24000000e+02, 7.70000000e+02, ...,
1.34400000e+03, 4.91161452e+00, 0.00000000e+00]])
My X_new shape OUTPUT
(10539, 9)
ADD and CHANGE random_state = 42
# import the library for data split
from sklearn.model_selection import train_test_split
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, Y_test = train_test_split(X_new, Y, test_size = 0.20, random_state = 42)
My X_train OUTPUT
array([[3.32000000e+02, 2.24000000e+02, 7.83000000e+02, ...,
0.00000000e+00, 5.51383493e+00, 0.00000000e+00],
[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 7.98061409e+00, 1.60000000e+01],
[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 4.87356955e+00, 1.60000000e+01],
...,
[3.32000000e+02, 2.24000000e+02, 2.71000000e+02, ...,
0.00000000e+00, 6.13220605e+00, 1.80000000e+01],
[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 3.44729660e+00, 1.60000000e+01],
[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,
3.30880000e+04, 4.45728744e+00, 0.00000000e+00]])
My y_train OUTPUT
9217 0
2935 1
871 1
1263 1
1618 1
..
5734 0
5191 0
5390 0
860 1
7270 0
Name: legitimate, Length: 8431, dtype: int64
My X_test OUTPUT
array([[3.44040000e+04, 2.40000000e+02, 8.22600000e+03, ...,
1.67360000e+04, 5.18299414e+00, 1.70000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.31670000e+04, ...,
3.27680000e+04, 5.90727483e+00, 1.30000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.31670000e+04, ...,
3.30880000e+04, 7.97187350e+00, 1.50000000e+01],
...,
[3.32000000e+02, 2.24000000e+02, 3.31660000e+04, ...,
0.00000000e+00, 7.45655950e+00, 0.00000000e+00],
[3.32000000e+02, 2.24000000e+02, 2.71000000e+02, ...,
0.00000000e+00, 5.14388114e+00, 1.80000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.31670000e+04, ...,
3.30880000e+04, 7.97187350e+00, 1.50000000e+01]])
My Y_test OUTPUT
518 1
439 1
7061 0
8141 0
8361 0
..
6326 0
10210 0
7474 0
4692 0
3799 0
Name: legitimate, Length: 2108, dtype: int64
My X_train OUTPUT
array([[-0.66719497, -0.65670766, -0.56942055, ..., -1.54623436,
0.01357061, -1.50298017],
[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,
1.31563293, 0.68421049],
[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,
-0.32438645, 0.68421049],
...,
[-0.66719497, -0.65670766, -0.61962315, ..., -1.54623436,
0.33997103, 0.95760933],
[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,
-1.07722898, 0.68421049],
[-0.66719497, -0.65670766, -0.62089782, ..., 0.89643878,
-0.54411639, -1.50298017]])
My X_test OUTPUT
array([[ 1.49881226, 1.4646342 , 0.16038005, ..., -0.31072367,
-0.16106007, 0.82090991],
[-0.66719497, -0.65670766, 2.60589354, ..., 0.87281525,
0.22124355, 0.27411224],
[-0.66719497, -0.65670766, 2.60589354, ..., 0.89643878,
1.31101931, 0.54751108],
...,
[-0.66719497, -0.65670766, 2.60579549, ..., -1.54623436,
1.03901646, -1.50298017],
[-0.66719497, -0.65670766, -0.61962315, ..., -1.54623436,
-0.18170544, 0.95760933],
[-0.66719497, -0.65670766, 2.60589354, ..., 0.89643878,
1.31101931, 0.54751108]])
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(max_leaf_nodes=None, random_state=0)
classifier.fit(X_train, y_train)
from graphviz import Source
from sklearn import tree
Source(tree.export_graphviz(classifier, out_file=None, feature_names=['Machine','SizeOfOptionalHeader','Characteristics','MajorLinkerVersion','MinorLinkerVersion','SizeOfCode','SizeOfInitializedData','SizeOfUninitializedData','AddressOfEntryPoint','BaseOfCode','BaseOfData','ImageBase'], filled=True))
ADD Code
y_pred = classifier.predict(X_test)
y_pred
My OUTPUT. Your result may be different!
array([1, 1, 0, ..., 0, 0, 0])
ADD Code
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, y_pred)
print(cm)
print(classification_report(Y_test, y_pred))
My OUTPUT. Your result may be different
[[1376 18]
[ 26 688]]
precision recall f1-score support
0 0.98 0.99 0.98 1394
1 0.97 0.96 0.97 714
accuracy 0.98 2108
macro avg 0.98 0.98 0.98 2108
weighted avg 0.98 0.98 0.98 2108
ADD CODE
from sklearn.neighbors import KNeighborsClassifier
classifier_KNN = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
classifier_KNN.fit(X_train, y_train)
print("KNN Model has been trained")
ADD CODE
y_pred_KNN = classifier_KNN.predict(X_test)
y_pred_KNN
My OUTPUT
array([1, 1, 0, ..., 0, 0, 0])
ADD CODE
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, y_pred_KNN)
print(cm)
print(classification_report(Y_test, y_pred_KNN))
My OUTPUT.
Your output may be different
[[1377 17]
[ 27 687]]
precision recall f1-score support
0 0.98 0.99 0.98 1394
1 0.98 0.96 0.97 714
accuracy 0.98 2108
macro avg 0.98 0.97 0.98 2108
weighted avg 0.98 0.98 0.98 2108
Reading List
Additional Notes
https://www.usenix.org/system/files/conference/cset18/cset18-paper-marchal.pdf
https://ieeexplore.ieee.org/document/6289402
https://dl.acm.org/doi/pdf/10.1145/3025453.3025831
https://www.ripublication.com/ijaer19/ijaerv14n9_15.pdf
https://docs.apwg.org/reports/apwg_trends_report_q1_2020.pdf
https://koreascience.kr/article/JAKO201310554376257.pdf
https://www.usenix.org/system/files/sec19-cidon.pdf
https://arxiv.org/pdf/2009.11116.pdf
https://cs229.stanford.edu/proj2012/ZhangYuan-PhishingDetectionUsingNeuralNetwork.pdf
http://www.neuro.nigmatec.ru/materials/themeid_17/riedmiller93direct.pdf
https://core.ac.uk/reader/30732240
https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/iet-ifs.2013.0202
Practical
The code and data are based on the following github::
ADD Code
%matplotlib inline
Change 'Labels'
dataset['Labels'].value_counts()
Change 'Labels'
features = dataset.drop('Labels', axis =1) #dropping the labels and saving the features
Labels = dataset["Labels"]
Change 'shapem' to 'shape'
features.shapem
from sklearn.model_selection import cross_validate #import additonal library for cross validation
scores = cross_validate(Attack_Detection_Model, X_train, Y_train, cv=10, return_train_score=False)
print(scores) #print the scores
My OUTPUT.
Your output may be different
{'fit_time': array([0.00843048, 0.00527406, 0.00514007, 0.00520182, 0.00562763,
0.00496912, 0.00503826, 0.00524092, 0.00513697, 0.00551486]), 'score_time': array([0.00237703, 0.00195217, 0.00189662, 0.00204945, 0.00196123,
0.00186229, 0.00186372, 0.00190568, 0.00188684, 0.00202179]), 'test_score': array([0.96 , 0.94 , 0.965, 0.97 , 0.905, 0.935, 0.95 , 0.95 , 0.97 ,
0.96 ])}
Import first and change the last line 'Graph ' to 'graph' ; small letter 'g'.
Remember case-sensative
import graphviz
# DOT data
dot_data = tree.export_graphviz(Attack_Detection_Model, out_file=None,
feature_names= X_train.columns,
filled=True)
# Draw graph
graph = graphviz.Source(dot_data, format="png")
graph
# Use the trained classifier to make predictions on the test data
predictions = Attack_Detection_Model.predict(X_test)
print(predictions)
print("Predictions on testing data computed.")
My OUTPUT
[-1 -1 1 ... 1 -1 -1]
Predictions on testing data computed.
from sklearn.metrics import confusion_matrix, classification_report #
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))
My OUTPUT
[[3462 563]
[ 306 4724]]
precision recall f1-score support
-1 0.92 0.86 0.89 4025
1 0.89 0.94 0.92 5030
accuracy 0.90 9055
macro avg 0.91 0.90 0.90 9055
weighted avg 0.90 0.90 0.90 9055