DDoS Attack Classification

CSC 466 Project - Matthew Beingessner

Project Report

466projectreport.pdf

Project Demo/Presentation Video

Midterm Update

Something I underestimated in my project proposal was the extent to which imbalanced data affects model performance. My dataset [1] is composed of five classes for DDoS attack classification, including Normal (89.6% of total data), UDP-Flood (9.3%), SMURF (0.6%), SIDDOS (0.3%), and HTTP_Flood (0.2%). The imbalance is dramatic, and since my project’s main purpose is to try to improve SMURF classification rates beyond those accomplished in [3], I knew I had to do something differently to address this imbalance.

SMOTE (Synthetic Minority Over-sampling Technique) [2] was the first technique I tried to use. SMOTE works by synthesizing new examples for the minority classes, and while this sounded promising at first, it turned out to be impractical for my situation. This is both because of the large size of the dataset (~400 MB) and the sheer extent to which the dataset was imbalanced. For instance, with only 0.2% of the data corresponding to HTTP_Floods and 89.6% corresponding to normal network behaviour, it’s apparent that I would have ended up with many gigabytes of data had I chosen to scale up the sizes of these smaller classes. Although not impossible, this proved to be rather impractical for me considering my lack of computing resources. Additionally, my goals align with those of [3] in that I want to build something applicable to real-life, where I’d imagine that data distributions are even more skewed on average than they are in [1].

Instead, I’ve chosen to address my imbalanced dataset in a much more computationally efficient and realistic manner, that being, with cost-sensitive learning. This is applicable because classifying a negative as a positive (i.e. thinking it is a DDoS when it isn’t) is not as bad of a problem as it is when you classify a positive as a negative (i.e. thinking it’s not a DDoS when it is). I’ve thus refined my experimentation techniques to a more specific list than what I had in my proposal, based on what will work best for efficiency and imbalance. This includes logistic classification, decision trees, and random forests. I’d still like to try SVMs and neural networks if time permits, but I’ll mostly be focusing on the simpler approaches.

References

[1] J. Van Steyn, “DDOS Attack Network Logs.” 07-Apr-2020.

[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, no. 1, pp. 321–357, 2002.

[3] M. Irenee, X. Hei, Y. Wang, W. Ji, and X. Jiang, “Network flow analytics: Multi-class classification of DDoS attacks based on OKNN,” in 2020 International Conference on Networking and Network Applications (NaNA), 2020.


Project Proposal

Problem

DDoS attacks flood targets with an excessive amount of requests to overload systems and prevent legitimate requests from going through. Companies fear them because they can cause the loss of customers, revenue, reputation, and more.

Previous Work

The paper that inspired this project performs DDoS attack type classification with Optimized K-Nearest Neighbors. They use real network data unlike most studies and have achieved high classification accuracy for the most part. However, they’ve only been able to classify “Smurf” attacks at a rate of 67%. Interestingly, they are unsure as to why this discrepancy is, and so I’d like to find this out. That being said, my primary objective is increasing the Smurf attack classification rate regardless of whether I understand why or not.

My Approach

I plan to experiment with several different techniques such as decision trees, support vector machines, and neural networks. The results from all of these will be compared and I’ll see if I’m able to get a rate higher than 67%. This new information along with some extra analysis/visualization might help shed light on the reason why the original paper’s authors struggled to classify this particular attack type.

Schedule

October 9th - 23rd

  • Improve understanding of previous work

  • Learn/review prerequisite material regarding computer networks and data mining

October 23rd - November 6th

  • Start report

  • Prepare midterm presentation

November 6th - November 20th

  • Finish most of report

  • Prepare final presentation

November 20th - December 5th

  • Finish report and presentation


References

M. Irenee, X. Hei, Y. Wang, W. Ji and X. Jiang, "Network Flow Analytics: Multi-Class Classification of DDoS Attacks Based on OKNN," 2020 International Conference on Networking and Network Applications (NaNA), 2020, pp. 271-276, doi: 10.1109/NaNA51271.2020.00053.