Bryan Solis

intrusion detection with machine learning

Part I. Introduction

Lauterbach, Chad. Intrusion Prevention Systems (IPS), 6 Mar. 2020 [9]

What is Intrusion Detection ?

"Intrusion detection is one major research problem in network security, whose aim is to identify unusual access or attacks to secure internal networks. In literature, intrusion detection systems have been approached by various machine learning techniques. However, there is no a review paper to examine and understand the current status of using machine learning techniques to solve the intrusion detection problems." [1]

Primary Goal & Description

The main goal of this project is to use Machine Learning algorithms to successfully predict the type of attacks based on the KDD’99 dataset.

We will create a model using Neural Networks which will help us optimize the accuracy of our results, and we can compare the model with other existing models such as SVM or Logistic Regression. If we are able to create the model on time, we will create a second model using Random Forest to compare the results with our first model.
Currently, there is not a model using Neural Network with this data set. Therefore, the biggest challenge I will face is the "Black Box" nature. In short, a Neural Network is a black box in the sense that while it can approximate any function, studying its structure won't give you any insights on the structure of the function being approximated [6].
1. - For example, banks do not use Neural Network to predict if a customer is creditworthy because the bank will need to explain why the customer didn't get certain loan. In Neural Network, the model will output either yes or no. The bank needs to see the "Reasons Behind" to explain the customer of certain decision.

By using Neural Network, this project aims to develop a predictive model to determine the type of attack that is being conduct inside of a Intrusion Detection System.

McDonald, Conor, Machine Learning Fundamentals. 29 December 2017 [7].

Dataset

KDD’99 has been the most popular data set used for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. [4] and is built based on the data captured in DARPA’98 IDS evaluation program [5]. DARPA’98 is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of network traffic.

The KDD99 data set is used to evaluate the proposed model, and it has 494021 rows and 41 columns
The attacks will fall into 4 different categories [4]
- Denial of Service Attack (DoS): It is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine.
- User to Root Attack (U2R): It is a class of exploit in which the attacker starts out with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
- Remote to Local Attack (R2L): It occurs when an attacker who has the ability to send packets to a machine over a network but who does not have an account on that machine exploits some vulnerability to gain local access as a user of that machine
- Probing Attack: is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.

Direct Link to Download Data Set

Related Work

What has been done before ?

With the introduction of Big Data, Traditional techniques has become more complex to deal with Big Data. Therefore, many researchers intend to use Big Data techniques to produce high speed and accurate Intrusion Detection System. In this section, we will learn how two techniques were used to improve the accuracy on this systems.

Spark Chi SVM. The author uses ChiSqSelector for features selection, and building a model using Support Vector Machine (SVM) [2].
K-Means. The author uses Mini Batch K-means combined with principal component analysis (PCA). [3]

Top Machine Learning Algorithms

Mishra, Alok, Top Machine Learning Algorithms. January 2020. [8]

Part Ii. DATA Exploration & MODEL CONSTRUCTION

Delivery II.

Second part of my capstone project, we will explore the dataset, the correlations, distributions and some important decisions for the model we will implement.

- Correlation Matrix
- New Categories
- Basic Neural Network Model

Correlation Matrix

During the data exploration, I was able to check the Correlation Matrix which is the way in that one set of data may correspond/correlate to another set. In our case, there are a few columns that have correlations, but more than 70% do not have a correlation with other columns in our dataset.

- For Neural Network, this do NOT affect our model. For example, If we have a linear correlated dataset, we just need a simple model like linear regression. Even the best CNN will give us a poor result. [10]

KDD 99 Dataset Correlation

Distribution of Categories

Columns with multiple Categories

Keep it OR Drop it?

Data Exploration showed that there exits some columns which have multiple categories, and these columns would need to be transform to numerical values. I had to decide between keeping the columns and encode the row values OR dropping these columns.

Keeping Columns
- The distribution on the Neural Network might have a greater impact on the model because the Neural Network would get more inputs for the neurons inside the model.
- More code to encode the categories of these columns
Dropping Columns
- It might get an bad impact on the final accuracy of the model
- Less code, but less columns to compare inside the model

Dataset after keeping the Categories

The information found inside these columns are the following:

- Service has 64 unique categories
- Flag has 11 unique categories
- Protocol Type has 3 unique categories
- After encoding the categories, we ended up with a new database that contains 119 columns which is almost twice as large than the original

Images of encoded before and after

Basic Model with Neural Network

Neural Network with 3 layers, 25 Epochs and Batch Size of 20

Final Accuracy 98.64%

Neural Network Model with 3 Layers

Neural Network with 5 layers, 25 Epochs and Batch Size of 20

Final Accuracy 99.02%

Neural Network Model with 5 Layers

Part Iii. New Models, Results and Conclusions

Delivery III.

Third and last part of the capstone project where we will compare two model against our Neural Network model.

- Decision Tree
- Random Forest
- Neural Network
- Results & Conclusions

Decision Tree vs Random Forest vs Neural Network

Classification Report for both Models DT vs RF vs NN

Accuracies

- Decision Tree got an accuracy of 77%
- Random Forest got an accuracy of 94%
- Neural Network got an accuracy of 97-99%

Why RF perform better than a Decision Tree ?

- Random forest leverages the power of multiple decision trees.
- It does not rely on the feature importance given by a single decision tree.

Why NN perform better than a Random Forest ?

- In our case, it performs better because of the non-linear and complex relationship in our dataset

Visualizations

Decision Tree

Decision Tree Visualization

Random Forest

Random Forest Tree and Different Estimators' tree

Neural Network

Neural Network 3 layers

Results

Models used and Tested

Accuracies for each Model

Logistic Regression 85%
Decision Tree 77%
Random Forest 93-95%
Neural Network 3 layer 97-98%
Neural Network 5 layers 98-99%

Conclusion

After the creation of our three different models, we finally reached to the conclusion that the Neural Network Model will outperform our Decision Tree model, Random Forest model and possibly a Logistic Regression model. Therefore, using this NN Model, we can argue that we are 98-99% accurate that we can predict the type of attack on a given KDD-Cup 99 dataset.

The Neural Network Model outperform other models because:

- - It had the ability to learn and model non-linear and complex relationship in our dataset
  - It had the ability to take in a lot of inputs, process them to infer hidden as well as complex, non-linear relationships
  - It had the ability to learn by itself and produce the output that is not limited to the input provided
  - It does not impose any restrictions on the input variables such as how they were distributed or handled