As the pandemic worsens, watch as our hero describes how to deal with imbalanced data sets. Time is now only measured in epochs.
I run the data set through machine learning algorithms while attempting to socially distance myself from family and friends. I also set up my initial Neural Network and fulminate about a certain research paper not providing enough hyper-parameters.
Reproducibility now!
This video discusses the research involved with the UNSW-NB15 dataset. Topics include feature selection, machine learning algorithms already used on the dataset, and reproducibility. Watch me attempt to say "Network Intrusion Detection System".
Source Time-To-Live:
https://searchnetworking.techtarget.com/definition/time-to-live
"Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that tells a network router whether or not the packet has been in the network too long and should be discarded."
Percent of bad traffic sttl greater than 50: 89.73%
Percent of bad traffic sttl less than 50: 0.71%
The purpose of this project is to use machine learning algorithms to identify malware attacks based on the UNSW-NB15 dataset. This dataset is a combination of real network traffic and synthetic malware attacks that was created by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). There are 49 unique features to the dataset including the label. They consist largely of network data such as IP addresses, connection times, and number of connections.
The biggest challenge I will face in this project is domain knowledge. I have no experience dealing with cybersecurity issues, so this is an opportunity to expand on my skill set and knowledge base. While I have a basic understanding of the way networks work, some of the features in this dataset mean nothing to me at the start of this project. And while the dataset is daunting at first glance, the lack of domain knowledge is mitigated by a few factors. First, that this subject is well researched already. There will be an opportunity to read the works of those that have come before me. This will give me insight into what has been successful in the past, and possibly allow me to build on those methods. Second, that this boils down to a classification problem; a problem that I am trained to solve as a data scientist.
The primary goal of this project is to build multiple machine learning algorithms for this classification problem, optimize them, and compare their performance. Additionally, I will explain the differences in their performance. If time allows, a secondary goal would be to build a neural network and include this in my comparison of results. There is research on neural networks applied to this dataset available, so while I expect that I will be able to get a neural network working during the course of this project, I am not sure what to expect in my ability to optimize it. The last goal that I am not sure I will achieve, but that I want to aim for, is that by the end I will not only have replicated results based on existing research, but that I will be able to do some sort of iteration that advances what is possible in this space.
The timeline of this project is outlined by the course syllabus. EDA and research will be completed by March 1st. Model construction will last until the end of March. Execution and interpretation of the model results will be done by May 1st.