Design: The model consists of two parts, namely denoising autoencoder (DAE) and LightGBM classifier. The network security datasets used to build the IDS are high dimensional with noise and corruptions. We propose to use DAE to reduce the dimensionality of the dataset. The gaussian noise in input layer of the DAE removes those distortions and functions as partial regularization on the encoder side to extract the useful hidden patterns. It outperforms the other feature extraction techniques such as PCA, Traditional AE, and other traditional feature selection techniques. The LightGBM classifier has various regularization strategies which gives higher prediction performance with fast training speed. It outperforms the traditional ML classifiers namely SVM, KNN, NB, MLP, LG and DT since they give lower prediction performance for the high dimensional network traffic based IDS model. The advantages of the both DAE and LightGBM are combined together to give a robust hybrid NIDS which can predict any types of intruders.
The proposed model is built and experimented using nine standard benchmarking public datasets. The processed data, extracted features and sample source codes for one of the dataset (ISCX-Tor2016) is given below:
Data processing:
In this stage, the useless values (e.g. NAN and INFINITY) and the redundant records are removed from the dataset. The categorical values of the features are converted to numerical values through one-hot encoding and the values are standardized/normalized using the normalization techniques. The processed features values are given in Torprocessed.CSV file.
Features Extraction:
After processing, the features values are given as input to the DAE to extract the low-dimensional hidden features. The dataset contains 26 high dimensional features. We try to extract 18 hidden patterns in lower-dimensional manifold and the extracted features are shown in Torfeatures.CSV file.
Classification:
The extracted features are fed as input to the LightGBM model to classify the input records and remarkable prediction performance is obtained. The source codes for all the data processing, features extraction and classification tasks are shown in file Torcodes.ipynb
Platform:
The proposed NIDS model is built on python platform (anaconda software) and its supporting libraries.
Instructions to use:
1.The anaconda software can be download from the website- https://docs.anaconda.com/anaconda/install/windows/ - and can use Jupiter notebook as code editor. (to script and edit the source codes).
2.The default working environment can be set up and files are stored there.
3. The other supporting libraries such as NumPy, SciPy, LightGBM, Keras with TensorFlow backend, Sklearn, Pandas can be pre-installed on the platform using pip command (e.g. pip install Sklearn).
4.The ISCX-Tor2016 dataset can be downloaded from the website URL https://www.unb.ca/cic/datasets/tor.html. It is named as tornew.csv.
5.The file contains 26 features with 152030 samples and an output label that denotes/represents two classes namely tor and non-tor.
6. The file is imported in the platform as tornew.csv
7. The imported csv file is processed as mentioned in data processing section and results (processed data) are shown in file Torprocessed.csv.
8.It can be divided into 80% training set and 20% test set. The training set is used to train the model and the test set is used to predict the model performance.
9. The feature values are passed to the DAE to extract the 18 low dimensional hidden patterns from the 26 high dimensional feature values.
10. The DAE we take contains a single hidden layer with 18 nodes to carry out the feature extraction task. The extracted features are shown in file Torfeatures.csv. The number of hidden nodes/patterns (i.e. here 18) depends on the input dataset and the model performance.
11. Finally the LightGBM classifier is used classify the input records using the extracted feature values and the prediction performance is measured using the standard quality metrics namely Accuracy, Precision, Recall, and F1-score. The training time of the classification model is also considered/taken.
12. The hyperparameters of the DAE and LightGBM can be finetuned to get the higher prediction performance.
13. The entire source codes for the model are shown in file Torcodes.ipynb.
14. The source codes can be executed. Since ML is a stochastic process, the experimentations can be done/repeated for 25 to 30 times and the average performance can be measured/noted.
15. It is to be noted that the proposed hybrid model outperforms the other traditional feature selection and classification methods in the existing models taken to compare with our model.
ps: Related files can be downloaded from HERE.