"Intrusion detection is one major research problem in network security, whose aim is to identify unusual access or attacks to secure internal networks. In literature, intrusion detection systems have been approached by various machine learning techniques. However, there is no a review paper to examine and understand the current status of using machine learning techniques to solve the intrusion detection problems." [1]
The main goal of this project is to use Machine Learning algorithms to successfully predict the type of attacks based on the KDD’99 dataset.
We will create a model using Neural Networks which will help us optimize the accuracy of our results, and we can compare the model with other existing models such as SVM or Logistic Regression. If we are able to create the model on time, we will create a second model using Random Forest to compare the results with our first model.
Currently, there is not a model using Neural Network with this data set. Therefore, the biggest challenge I will face is the "Black Box" nature. In short, a Neural Network is a black box in the sense that while it can approximate any function, studying its structure won't give you any insights on the structure of the function being approximated [6].
For example, banks do not use Neural Network to predict if a customer is creditworthy because the bank will need to explain why the customer didn't get certain loan. In Neural Network, the model will output either yes or no. The bank needs to see the "Reasons Behind" to explain the customer of certain decision.
By using Neural Network, this project aims to develop a predictive model to determine the type of attack that is being conduct inside of a Intrusion Detection System.
KDD’99 has been the most popular data set used for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. [4] and is built based on the data captured in DARPA’98 IDS evaluation program [5]. DARPA’98 is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of network traffic.
The KDD99 data set is used to evaluate the proposed model, and it has 494021 rows and 41 columns
The attacks will fall into 4 different categories [4]
Denial of Service Attack (DoS): It is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine.
User to Root Attack (U2R): It is a class of exploit in which the attacker starts out with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
Remote to Local Attack (R2L): It occurs when an attacker who has the ability to send packets to a machine over a network but who does not have an account on that machine exploits some vulnerability to gain local access as a user of that machine
Probing Attack: is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.
With the introduction of Big Data, Traditional techniques has become more complex to deal with Big Data. Therefore, many researchers intend to use Big Data techniques to produce high speed and accurate Intrusion Detection System. In this section, we will learn how two techniques were used to improve the accuracy on this systems.
Spark Chi SVM. The author uses ChiSqSelector for features selection, and building a model using Support Vector Machine (SVM) [2].
K-Means. The author uses Mini Batch K-means combined with principal component analysis (PCA). [3]
Second part of my capstone project, we will explore the dataset, the correlations, distributions and some important decisions for the model we will implement.
Correlation Matrix
New Categories
Basic Neural Network Model
During the data exploration, I was able to check the Correlation Matrix which is the way in that one set of data may correspond/correlate to another set. In our case, there are a few columns that have correlations, but more than 70% do not have a correlation with other columns in our dataset.
For Neural Network, this do NOT affect our model. For example, If we have a linear correlated dataset, we just need a simple model like linear regression. Even the best CNN will give us a poor result. [10]
KDD 99 Dataset Correlation
Columns with multiple Categories
Data Exploration showed that there exits some columns which have multiple categories, and these columns would need to be transform to numerical values. I had to decide between keeping the columns and encode the row values OR dropping these columns.
Keeping Columns
The distribution on the Neural Network might have a greater impact on the model because the Neural Network would get more inputs for the neurons inside the model.
More code to encode the categories of these columns
Dropping Columns
It might get an bad impact on the final accuracy of the model
Less code, but less columns to compare inside the model
The information found inside these columns are the following:
Service has 64 unique categories
Flag has 11 unique categories
Protocol Type has 3 unique categories
After encoding the categories, we ended up with a new database that contains 119 columns which is almost twice as large than the original
Images of encoded before and after
Neural Network Model with 3 Layers
Neural Network Model with 5 Layers
Third and last part of the capstone project where we will compare two model against our Neural Network model.
Decision Tree
Random Forest
Neural Network
Results & Conclusions
Decision Tree got an accuracy of 77%
Random Forest got an accuracy of 94%
Neural Network got an accuracy of 97-99%
Random forest leverages the power of multiple decision trees.
It does not rely on the feature importance given by a single decision tree.
In our case, it performs better because of the non-linear and complex relationship in our dataset
Decision Tree
Random Forest
Neural Network
Logistic Regression 85%
Decision Tree 77%
Random Forest 93-95%
Neural Network 3 layer 97-98%
Neural Network 5 layers 98-99%
After the creation of our three different models, we finally reached to the conclusion that the Neural Network Model will outperform our Decision Tree model, Random Forest model and possibly a Logistic Regression model. Therefore, using this NN Model, we can argue that we are 98-99% accurate that we can predict the type of attack on a given KDD-Cup 99 dataset.
The Neural Network Model outperform other models because:
It had the ability to learn and model non-linear and complex relationship in our dataset
It had the ability to take in a lot of inputs, process them to infer hidden as well as complex, non-linear relationships
It had the ability to learn by itself and produce the output that is not limited to the input provided
It does not impose any restrictions on the input variables such as how they were distributed or handled
Othman, S.M., Ba-Alwi, F.M., Alsohybe, N.T. et al. Intrusion detection model using machine learning algorithm on Big Data environment. J Big Data 5, 34 (2018). https://doi.org/10.1186/s40537-018-0145-4
McDonald, Conor. “Machine Learning Fundamentals (II): Neural Networks.” Medium, Towards Data Science, 29 Dec. 2017, towardsdatascience.com/machine-learning-fundamentals-ii-neural-networks-f1e7b2cb3eef.
Lauterbach, | Chad. “Intrusion Detection, Intrusion Prevention, and Antivirus: The Differences.” Be Structured Technology Group, 6 Mar. 2020, www.bestructured.com/intrusion-detection-intrusion-prevention-and-antivirus-the-differences/
T. Kohonen, "Correlation Matrix Memories," in IEEE Transactions on Computers, vol. C-21, no. 4, pp. 353-359, April 1972, doi: 10.1109/TC.1972.5008975.