Biweekly Updates

Biweekly Update (Oct 10 - Oct 24)

In the last two weeks, I went through some previous papers and works. I find there are some works doing botnet detection by analyzing network flow using machine learning and deep learning methods. Antoine Delplace et al [1] indicated that the Random Forest algorithm performs the best among other machine learning algorithms shown in the paper on botnet detection work on the CTU-13 dataset. There are also some other works using deep neural networks such as LSTM, and RNN, to do the prediction of botnet detection. McDermott CD et al [2] proposed a method using BiLSTM to do botnet detection. The result in the paper showed that BiLSTM works well and returns high precision and low loss outputs. I will first try to do the prediction by the LSTM network. I studied more about LSTM algorithm in the last two weeks and tried to do some training and tests using this algorithm. Next, I will try to assign the network traffic dataset to the LSTM model to do the training and testing to see its performance. I will use the CTU-13 dataset to do the training as it is a complete dataset including 13 captures of different botnet samples. Then, I will try some other deep neural networks and compare the results to find out the best one. I will spend these two weeks doing the programming work.

[1] Delplace A, Hermoso S, Anandita K. Cyber Attack Detection thanks to Machine Learning Algorithms. arXiv preprint arXiv:2001.06309. 2020 Jan 17.

[2]McDermott CD, Majdani F, Petrovski AV. Botnet detection in the internet of things using deep learning approaches. In2018 international joint conference on neural networks (IJCNN) 2018 Jul 8 (pp. 1-8). IEEE.

Biweekly Update (Nov 7 - Nov 20)

I worked on programming and implementation in the last two weeks. I finished the data processing for the CTU-13 dataset and dropped some of the features that didn’t contribute a lot to the training process or contained too many NULL values. The dataset is very unbalanced, only 1.7% of data are labeled as “botnet”. It will apparently impact the final result and make the model prefer to predict lower values and mixed those labels. I tried to assign weights to different labels but it didn’t work well. Thus, I reorganized the dataset and dropped many rows to make the percentage of “botnet” go higher.

I finished the modeling of RNN, LSTM, and GRU by TensorFlow. According to Yoav’s work[1], deeper RNNs may work better than shallower ones in some cases. Thus, in these models, I stack several RNN (GRU, LSTM) models together. I used CuDNNGRU and CuDNNLSTM from the Keras library instead of the classic ones, which is the version of GRU and LSTM supported by CuDNN that can make the training and inference process faster. In my further work, I will try to discover the performance difference between CuDNN and classic ones. The metrics of the GRU model are shown below:

Figure 1. GRU metrics

The confusion matrix is shown below:

Figure 2. Confusion matrix

The confusion shows true negative, false positive, false negative, and true positive predictions of the model. As can be seen from the graph, the model correctly predicted most botnets.

I then organized the model into a real-time system. I used pyshark to capture real-time packets and parse them to get the data I need then organized them into the format of the CTU-13 dataset. I considered bidirectional NetFlow just like the CTU-13 dataset. I set the time window as 5s. All packets in a 5s time window that has the same source and destination and protocol will be organized together. I send all these data to the models to do predictions. If the predicted value is smaller than 0.5, the system will consider it as a normal netflow. Otherwise, the system will consider it as botnet NetFlow, will show the source and destination, and will block the IP address from that source. The running process is shown below:

Figure 3. Real-time system

Currently, all the packets I captured are been considered normal NetFlow. I think there might be two main reasons:

The CTU-13 dataset is too old, and it didn’t represent real-world botnets anymore. So the system can’t classify the netflow correctly.
The firewall and security software is powerful enough now, so the botnets can’t access my personal computer

In the next two weeks, I will try to find and use a newer dataset to see if there will be any differences. I will also do some modifications to the models and systems to see how it goes.

[1] Goldberg Y. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research. 2016 Nov 20;57:345-420.