Changes in Proposal
This project is now individual work. I spent some time reviewing this topic, and I decided to make some changes to my proposal. I have analyzed the CTU-13 dataset [1]. I found this dataset includes a large amount of data, which is sufficient for deep learning algorithms. My overall goal is to build a prediction model which could detect botnets. In this case, I think deep learning models are more appropriate. Thus, I decided not to use random forests or other machine learning models, but only focus on deep learning methods such as RNN, LSTM, GRU, etc.
Data Processing
The CTU-13 datasets include botnet traffic captures stored in a pcap file. The complete pcap file containing all the background, normal and botnet data is not available due to privacy reasons. The pcap files for botnet captures only are available. The botnet pcap file is shown as follows:
Meanwhile, the CTU-13 includes labeled bidirectional NetFlows (the bidirectional traffic flows from two hosts are considered as one flow) files. I downloaded the cleaned NetFlow data from Kaggle [2] with all data types set correctly without missing information and delicate records. The cleaned NetFlow file is stored in CSV and structured as follows:
As can be seen in the images, the bidirectional NetFlow data include several attributes and labels. The labels include normal traffic flows and different types of botnet flows. These labeled data can be used to train deep learning models. One problem I met when I try to transfer these data into tensors is that the PyTorch tensor does not accept strings as input. Melisha et al posted a tutorial on using CTU-13 to do botnet detection using machine learning [3]. I learned from this tutorial in data processing. I used integers to represent the string data of Proto, State, and Labels. I used 80% of the dataset as training data and 20% as testing data.
Modeling
I’m currently working on the programming and training of RNN and LSTM. Unlike other ML or DL models, RNN model is cycled. The output of a particular layer in RNN would also be used as an input. RNN does the same operations to every feature, and every operation in RNN depends on the previous results. In other words, RNN can “remember” the things that have been calculated before. The structure of RNN is shown as follows:
LSTM is a variant of RNN. It contains an input gate, a forget gate, and an output gate. The input gate decides what information can be added to the memory. The forget gate decides the information that would be discarded from the memory. The output gate decides the information that would be output from the current state. LSTM can solve the vanishing gradient problem in RNN. The structure of LSTM is described as follows:
These models could perform well on the NetFlow dataset to do botnet prediction. I use Pytorch to implement these DL models.
Further Work
I’m still working on the programming and training of RNN and LSTM. I will also try other DL models on CTU-13, then I will organize all models into a system. I’m expected to finish the programming and training of all models in the next two weeks. At the end of November, I will come out with a system that allows users to compare the performance of different models in botnet detection.
References
[1] Garcia S, Grill M, Stiborek J, Zunino A. An empirical comparison of botnet detection methods. computers & security. 2014 Sep 1;45:100-23.
[2] D'hooge, StrGenIx, “CTU-13”, Kaggle, 2022, CTU-13 | Kaggle.
[3] Dsouza, Melisha, “Build botnet detectors using machine learning algorithms in Python [Tutorial]”, <packt>hub, 2018, Build botnet detectors using machine learning algorithms in Python [Tutorial] | Packt Hub (packtpub.com).
[4] “What’s the difference between CNN and RNN?”, TELUS international, 2021, What's the Difference Between CNN and RNN? (telusinternational.com).
[5] “Long short-term memory”, Wikipedia, Long short-term memory - Wikipedia.