Attack Detection Methods For Network Security Issues
PHASE-1
INTRODUCTION
With the development of the fifth-generation networks and artificial intelligence technologies, new threats and challenges have emerged to wireless communication system, especially in cybersecurity. Attack detection methods involving strength of deep learning techniques can be build. Specifically, the plot is to summarize fundamental problems of network security and attack detection and introduce several successful related applications using deep learning structure. On the basis of categorization on deep learning methods, the primary importance is given to attack detection methods built on different kinds of architectures, such as autoencoders, recurrent neural network.
Names Of The Attacks Which Needs To Be Classified And Predicted :
Brute Force -Web : BruteForceWeb
Brute Force -XSS : BruteForceXSS
'SQL Injection : SQLInjection
A brute force attack uses trial-and-error to guess login info, encryption keys, or find a hidden web page. Hackers work through all possible combinations hoping to guess correctly. These attacks are done by ‘brute force’ meaning they use excessive forceful attempts to try and ‘force’ their way into your private account.
Brute Force -XSS – Cross-site scripting (XSS) :Cross-site scripting attacks disrupt the interaction between users and the vulnerable application. It is based on client-side code injection. The attacker inserts malicious scripts into a legit application to alter its original intention. These attacks are common in web applications written in JavaScript, CSS, VBScript, ActiveX, and Flash .
SQL Injection: SQL injection is a code injection technique that might destroy the database. SQL injection is one of the most common web hacking techniques. SQL injection is the placement of malicious code in SQL statements, via web page input.
Content Of The Dataset:
This dataset contains 80 features. Each instance holds the information of an IP flow generated by a digital network device i.e., source and destination IP addresses, ports, interarrival times, layer 7 protocol (application) used on that flow as the class, among others. Most of the attributes are numeric type but there are also nominal types and a date type due to the Timestamp.
Link to the Dataset: https://drive.google.com/file/d/1HDeyt92iaXASdQ_CsuZbwpSX7yCJEkv0/view?usp=sharing
PHASE 2:
Data Pre-processing with EDA :
1. Pandas DataFrame.fillna() is used to replace Null values in dataframe Our dataset csv file has null values, which are displayed as NaN in Data Frame. Just like pandas dropna() method manage and remove Null values from a data frame, fillna() manages and let the user replace NaN values with some value of their own.
2. The dataset have many inf values as well. The simplest way to handle infinity values would be to first replace them (infs) to NaN: df.replace([np.inf, -np.inf], np.nan) and then use the first method to replace nan values.
Snapshot of Original DataFrame:
0 1000.000000
1 2000.000000
2 3000.000000
3 -4000.000000
4 inf
5 -inf
Removing infinite values:
0 1000.0
1 2000.0
2 3000.0
3 -4000.0
4 NaN
5 NaN
3. Transforming the prediction target variable (y) These are transformers that are not intended to be used on features, only on supervised learning targets (class Labels). In other words This transformer should be used to encode target values, i.e. y, and not the input X. Encode target labels with value between 0 and n_classes-1 using below API
class sklearn.preprocessing.LabelEncoder
Cyber Attacks in the dataset-TYPE
I have chosen to encode the text values by putting a running sequence for each text values like below:
Brute Force -Web': 0 '
Brute Force -XSS': 1 '
SQL Injection':'2
Principal Component Analysis:
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process. So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.
I have chosen the combined features which have most variance among them. Rest of the features , having least variance do not carry any good significance. These are the principle components. From figure 1 , It looks like around 5 or 6 of the PCA features explain the majority, almost 95% of the data.
Convert the PCA features into the 2 top features. (We can change this value to other value). We'll then plot a scatter plot of the data point classification based on these n number of features. Fig 2 shows this scattering of data points .
Majority of the features after PCA are from one class due to extreme class imbalance. Anyway, from the data-set we see that it has 80 features, so let’s reduce it to only 2 principal features and then we can visualize the scatter plot of these new independent variables .The principal components are calculated only from features and no information from classes are considered. So PCA is unsupervised method and it is difficult to interpret the two axes as they are complex mixture of original features.
Majority of the data points are from one class due to extreme class imbalance.
For most dominant class , the number of data points are around 1048009.
For second lass- 362
For third class -151
For 4th class – 53
PHASE-3
Optimization using Deep Learning Models:
A classification problem may be a little skewed, such as if there is a slight imbalance. Alternately, the classification problem may have a severe imbalance where there might be hundreds or thousands of examples in one class and tens of examples in another class for a given training dataset .
• Slight Imbalance. An imbalanced classification problem where the distribution of examples is uneven by a small amount in the training dataset .
Severe Imbalance. An imbalanced classification problem where the distribution of examples is uneven by a large amount in the training dataset.
A slight imbalance is often not a concern, and the problem can often be treated like a normal classification predictive modeling problem. A severe imbalance of the classes can be challenging to model and may require the use of specialized techniques. The class or classes with abundant examples are called the major or majority classes, whereas the class with few examples (and there is typically just one) is called the minor or minority class. • Majority Class: The class (or classes) in an imbalanced classification predictive modeling problem that has many examples.
• Minority Class: The class in an imbalanced classification predictive modeling problem that has few examples.
MODEL ARCHITECTURE FOR FEATURE EXTRACTION :
Sparse, Stacked and Variational Autoencoder are used to reconstruct input data.
IMBALANCED DATA WITH FEATURE EXTRACTION:
In this, a stacked encoder model with comparatively smallest number of learnable parameters (8,923) is trained over just few epochs of 5.The number of parameters and epoch are small in number because after feature selection task, the model is trained over a smaller number of features (just 39).
BILSTM: The latent feature space of the Autoencoder is so discriminative that the second BILSTM model which is used as classifier converges just over a single epoch.
Because the number of features are less in number, so Architecture of the BILSTM network demands to have less number of parameters to get convergence. So the build architecture have 264,708 learnable parameters. Deep network with more number of parameters while small feature space is likely to over-fit. So reducing model capacity becomes quite important to reduce over- fitting.
IMBALANDED DATA WITH PCA:
In this ,a stacked encoder model with 62,552 number of learnable parameters is trained over 30 epoch on 76 features. Compared with previous model, the number of features in the above model are approximately double so required number of epoch are comparatively large. (30 vs 5)
Because Auto-encoders are learned automatically from data examples, which is a useful property: it means that it is easy to train specialized instances of the algorithm that will perform well on a specific type of input. It doesn't require any new engineering, just appropriate training data.
BILSTM:
The number of features are large in number so Architecture of the BILSTM demands to have large number of parameters to get convergence. So the build architecture have 857,604 learnable parameters.
SMOTE ALGORITHM:
New synthetic data samples are generated first using SMOTE algorithm to overcome data imbalance problem. A stacked encoder model with large number of number of learnable parameters 62,552 is trained over 40 epoch. The number of parameters and epoch are large in number because after data creation , we are left with almost 4 times the number of data samples (from 1000000 to 4000000).
Training of the stacked encoder model.
The number of features remains intact even after producing artificial data, so Architecture of the BILSTM network demands to have large number of parameters to get convergence. The build architecture have 264,708 learnable parameters.
References:
Network Attacks Detection Methods Based on Deep Learning Techniques: A Survey- Yirui Wu,Dabao Wei, Jun Feng ,28 Aug 2020,https://www.hindawi.com/journals/scn/2020/8872923/?gclid=EAIaIQobChMIn5PjzIqX7QIVidCyCh2EewrOEAMYASAAEgKJyfD_BwE
M. A. Al-Garadi, A. Mohamed, A. Al-Ali, X. Du, and M. Guizani, “A survey of machine and deep learning methods for internet of things (iot) security,” 2018, http://arxiv.org/abs/ 11023.
SMOTE for Imbalanced Classification with Python,-Jason Browniee, Jan 17 2020,https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
Principal Component Analysis (PCA) with Python Examples — Tutorial,Saniya Parveez, Roberto Iriondo , jan 8 2021,https://pub.towardsai.net/principal-component-analysis-pca-with-python-examples-tutorial-67a917bae9aa
A Stacked Autoencoder-Based Deep Neural Network for Achieving Gearbox Fault Diagnosis,-Guigang Liu, Huaiqian Bao, Baokun Han, 25 july 2018, https://www.hindawi.com/journals/mpe/2018/5105709/
Sparse, Stacked and Variational Autoencoder- Venkata Krishna Jonnalagadda, Dec 6 2018, https://link.medium.com/toCyh2j0Zfbhttps://link.medium.com/toCyh2j0Zfb
Stacked Autoencoders for the P300 Component Detection- Lukas Vareka, Pavel Mautner, 30 May 2017 ,https://www.frontiersin.org/articles/10.3389/fnins.2017.00302/full
Applied Deep Learning - Part 3: Autoencoders- Arden Dertat, Oct 3 2017,https://link.medium.com/OBWLEkv0Zfb
Densely Connected Bidirectional LSTM with Applications to Sentence Classification-Zixiang Ding, Rui Xia, Jianfei Yu, Xiang Li, Jian Yang , 3 Feb 2018, https://arxiv.org/abs/1802.00889
SMOTE: Synthetic Minority Over-sampling Technique- N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer,
9 June 2011 ,[1106.1813] SMOTE: Synthetic Minority Over-sampling Technique (arxiv.org)
GITHUB Repository:
https://github.com/bharadwajburra/Capstone-606