Q:
Great presentation Peter!
I love how detailed it was and it never dragged. One question I had was what part of your project do you find the most interesting thus far?
A:
Sure. The most interesting part should be building the real-time system. Since I need to capture the same type of data as those datasets and then put them into the model. I need to review their papers to figure out how they do that. It's an exciting process that I learned how to combine independent packets and make them meaningful.
Q:
Hi Peter, great presentation, I was wondering how you decided certain attributes were unnecessary or meaningless. Do you think they might have affected your results if you did not?
A:
Thank you for your question! There are many features, such as a specific IP address or port or captured time in the datasets being meaningless. It's nearly impossible to get the same data in the real world as they are too specific. We want more general data such as the total packet number, the total size of all packets, the number of packets sharing the same source IP address, etc. This meaningless data would affect my results. I haven't removed them in the CTU-13 dataset, so I got a low precision rate. After changing to a more general dataset, UNSW-NB15, the precision rate increased significantly and reached nearly 98%.
Q:
excellent presentation, peter,
wanted to know the algorithm you used in cleaning the data, especially the second data set that contained other attack scenarios
A:
Thank you. I removed all nan value and meanless features in the CTU-13 dataset by converting the csv files into dataframes. Then I converted all strings, such as proto, IP address, and time, into integers or floats since NumPy cannot take strings as input. I converted each proto into a specific integer and then saved them in a dictionary. Then I used label 1 to represent botnet and 0 for normal netflow. For the second dataset, UNSW-NB15, I haven't done too much cleaning work since it's already been cleaned. There is no IP or time in that dataset, and all attack scenarios have been labeled 0 or 1. All I did in that dataset is converted the proto into integers and save them in a dictionary.
Q:
Hello Peter,
Very well structured presentation. I like the fact that the slides are on point and there's a balance between text and pictures.
Have been following your project progress throughout the term and I have to congratulate you on the good work.
I would also like to appreciate the fact that you've talked about Possible improvements and Future work and what are the scope of further improvements.
Great work!
A:
Thank you so much for your interest in my project! There are still many weaknesses in my project, so I would like to point them out for future work. I believe it's very important to focus not only on this project but up-to-date work done by other people and further improvements.
Q:
Great presentation! Those were some impressive accuracy scores. I look forward to the report!
A:
Thank you! The accuracy scores are actually not very high compared with some recent works. Accuracy is a combination of precision and recall. I think in this project, precision is a more important metric in evaluating the model's performance than accuracy because it directly shows the extent to which the model can detect attacks. As I mentioned in the slides, there are still many improvements that could be made to enhance the precision and accuracy.
Q:
which features affect the result most? also for ip addresses and port numbers, although the exact value does not matter, the pattern may. e.g., if a computer talks to many, it may indicate port scanning, and if many talk to one, likely ddos?
A:
Yes. The patterns are essential for detection. In the UNSW-NB15 dataset, there are no exact values for IP or port but only patterns that represent in a particular time window, the number of packets sharing the same source IP, the same destination IP, the same source IP and destination port, etc. Compared with CTU-13, which used exact IP and port value as features, the precision rate in the UNSW-NB15 model increased by more than 60%.
I designed an experiment to find out the most important features. The results showed those features related to patterns affect the result most. After removing them from the training dataset, the precision rate decreased by 1% and recall decreased by 7%. Other features, such as packet number, packet size, and mean packet size, do not significantly impact the result.