Malware is a serious and ongoing problem, with bad actors constantly developing and releasing new variants of malicious software. In order to protect computer systems and networks from damage, it is essential to quickly identify and classify this malware. This project will seek to compare the accuracy of various machine learning algorithms on classifying malware using the UNSW-NB15 data set created by Nour Moustafa and Jill Slay at the University of New South Wales, Canberra, Australia [1,2].
Data and files for this project can be found at: https://github.com/pvan1/DATA-606-Capstone
Malicious software in various forms have long threatened computer systems and networks; in the ongoing struggle with cybersecurity researchers, bad actors continually adapt and develop new forms of malware [3]. Malware can cause serious damage to systems, resulting in loss of confidentiality, integrity and access. According to Verizon’s 2020 Cybersecurity report, use of some form of malware was involved in 17% of data breaches experienced by organizations [4]. Protecting against malware requires the ability to quickly and accurately identify and classify it.
Historically, most malware detection systems have been signature-based [3]. This type of classification requires a specific example of malware to be analyzed by researchers, who create a signature file that can be used to identify that particular malware in the future [3]. A major drawback of signature-based classification is that it requires the malware to be analyzed in advance, meaning that novel malware which has never been seen before will go unidentified [3].
A second approach to classification of malware is behavior or anomaly-based. This technique involves training a machine learning model on baseline, normal system behavior. After learning the baseline, the classifier is able to recognize abnormal behavior and flag it as suspicious [7]. The advantage of this method is the ability to detect previously unseen malware [7].
Whether using a signature- or anomaly-based approach, the analysis can be static, dynamic, or hybrid [7]. Static analysis involves investigating executable files without running them, for example by examining the file’s header and strings [3]. Dynamic analysis is conducted by executing a program and examining its behavior, including API calls, DLL usage, registry activity, or network traffic [3]. Static and dynamic methods can also be combined for a hybrid approach.
Souri and Hosseini conducted a survey of various approaches and found many examples of signature and anomaly-based detectors using dynamic, static and hybrid analysis and performing classification with a variety of machine learning algorithms [6]. They found signature-based approaches utilizing dynamic analysis and K-means, Decision Trees, SVMs and Naive Bayes using datasets including DLL and API calls and n-grams; these methods ranged in accuracy from 88% to 99% [6]. They also found accuracy of 86% to 99% among the anomaly-based approaches surveyed; these approaches used algorithms such as Random Forests, Graph search, Decision Trees, and Logistic Regression, among others [6]. It was also shown that the majority of approaches, especially anomaly-based, relied on dynamic analysis rather than static or hybrid [6].
In one experiment, Sharma et al. used opcodes to classify malware, by first testing several feature selection methods to determine which features contributed most to the class determination, and then testing those features in several machine learning algorithms [5]. The result of their experiment was 100% accuracy using Random Forests, Logistic Model Trees, Naive Bayes Trees, and J48 Graft [5]
This project seeks to investigate a dynamic approach to an anomaly-based malware classification system. The data set used for this exploration consists of network traffic data and classification labels, collected and prepared by Nour Moustafa and Jill Slay at the University of New South Wales [1,2]. Because the data has been pre-processed, little cleaning should be necessary, but some transformations may be needed for better compatibility with machine learning algorithms. The proposed methodology will be to compare several feature selection algorithms to identify the most useful features of the finalized dataset, and then using the best features to compare the accuracy of various machine learning classification algorithms, such as k-nearest neighbors, random forests, and neural networks [actual models TBD].
We'll begin this project with an exploration of the dataset. A quick look at the first few rows will give an impression of what we've got.
Many of the features are truncated in this view, but we can see that in addition to the numerical values representing aspects of network traffic flow, we also have several categorical features that we'll want to deal with later and we have what appears to be a binary label feature and an attack category that we'll likely use for multiclassification.
Taking a look at the distribution of labels and classes, we find that the training set contains about 56,000 instances of normal traffic and 120,000 instances of malicious traffic.
We can see further how that malicious traffic breaks down into specific categories of attack, with the majority being generic attacks and exploits and very few worms.
Here we can see the full description of the dataset. We have 175,341 non-null instances in this training set. We can also see the full listing of the network traffic features including a number generated features, as described in [1].
Since none of these features seems to have an intrinsic order to it's values, we'll probably want to expand those categorical features into one-hot vectors for later use in ML. Let's first have a look at the unique values of each feature.
There are 13 services, 9 states and 133 unique protocols. Expanding each of this features into separate binary features for each value will increase our number of features to around 200. That's probably not going to be too bad, but we'll investigate some feature reduction later on.
We'll calculate the Pearson correlation coefficient of each feature with the binary label to get a basic idea of how these features relate for now, and visualize a few of the higher correlations.
For now it looks like we have a few feature that are reasonably correlated to the label, let's investigate a little further.
We'll next explore several feature selection methods, starting with a closer look at feature correlation.
To better visualize the correlation between features we can check a heat map of the correlation matrix. We can see that there are a number of features that are moderately or strongly correlated with each other, suggesting redundant information that we may want to remove before training any models on this data.
The following features have a Pearson Correlation Coefficient of 0.7 or greater with another feature: ct_dst_ltm, ct_dst_src_ltm, ct_ftp_cmd, ct_src_dport_ltm, ct_src_ltm, ct_srv_dst, ct_srv_src, dbytes, dloss, dtcpb, dttl, dwin, sbytes, sinpkt, sloss, stcpb, synack, tcprtt.
We'll experiment with removing those features to improve model accuracy.
We also calculate the covariance of the features. We're looking to see if any features vary together and are linearly correlated [10]. Checking the covariance matrix heat map reveals that two features, stcpd and dtcpd, have a strong positive covariance. These features represent the source and destination TCP sequence numbers. These features may be providing similar information, so it may not be necessary to include both of these features in the final data set.
Next, we continue investigating feature selection by calculating the information gain, or mutual information, of each feature with the binarized version of the attack category. First we make the calculation with only the initial numerical features (not including the expanded, one-hot, categorical features). We can see a few stand outs: sbytes, sload, and smean. These features represent the number of bytes transmitted from the source, number of bits per second, and mean packet size transmitted from the source to the destination [1]. We also see a few features that apparently contribute little to the classification, including: is_ftp_login, ct_ftp_cmd, and is_sm_ips_ports. These features indicate whether the session was FTP and if the source and destination used the same IP port [1].
Including the expanded categorical features, reveals that most of those features provide very little or no information gain on the attack category, suggesting that the service, state, and protocol features may not be useful. Note that the values on this graph are in logarithmic scale to show more features.
For comparison we look at the ANOVA (analysis of variance) F-value of features calculated against the binarized attack category. Again, this test includes only the original numeric features and not the expanded categories.
Continuing the comparison, we check the F-values for all of the expanded categorical features. We see that many features have an F-value near 10, while a few are in the tens of thousands.
We'll try a couple more methods to reduce the number of features, so that we can experiment with model accuracy. First we'll use a simple variance threshold method to select only those features with a high variance, in this case the threshold is set to 0.8. The rational for using this method is the assumption that features with little or no variance will not provide much information [8]. The following set of features were found to have high variance in this data set.
Another way we can measure the variance of the features is with the Median Absolute Deviation. This statistic is said to be similar to standard deviation, but more robust to outliers [9]. Again, the assumption is that features with higher variability will provide more information. We see a few features with a high median absolute deviation, and the majority with very small or zero MAD.
Visualizing makes it a little clearer. The values are shown in log scale.
As we move on to ML modeling, we'll collect the top features suggested by each feature selection method for experimentation.
Training a model on each subset of features will allow us to compare accuracy of classification, and hopefully arrive at the best possible model.
Before moving on to an exploration of ML models, a classmate suggested taking a look at the Recursive Feature Elimination algorithm. It seemed a worthwhile suggestion, and Scikit-Learn's Recursive Feature Elimination with Cross Validation function was used to select the best number of features [11, 12]. The algorithm identified 165 features as the optimal subset, but there are a couple of issues with that assessment. First, this graph is 'spiky', if we smoothed this graph out the peak would likely come somewhere between 50 and 75 features, plateauing after that. Second, 165 features is the majority of the input features and doesn't really help in terms of feature reduction. For those reasons we may want to take a smaller optimal set, but for now we'll leave it as is.
Let's start to prepare the dataset for ML. We'll transform the categorical features into separate binary features and separate the labels and attack categories from the predictor features. We'll also apply a Standard Scaler from the Scikit Learn library, since our features are on different scales and so have a wide range of values.
Feature statistics before standard scaling.
Feature statistics after standard scaling.
Now that we have scaled the data and identified some likely feature subsets, let's start exploring some models.
We start with a fairly simple model, Naive Bayes. This model has no hyper-parameters to tune, so it was simply trained on each subset of features (using all instances in the training set) and evaluated with 5-fold cross validation. At this stage predictions are made on the training set, using Scikit-Learn's cross_val_predict function with 5 folds. A classification report was generated and confusion matrix plotted for each feature subset, and these were compared. The confusion matrix plot is normalized to account for the skewed data set. Because there are 10 target classes to predict, we also look at the F1 score of each class and the weighted average of F1 scores of all classes.
The features selected based on Information Gain produced an accuracy of 50% with a standard deviation of 0.05 and a weighted average f1-score of the target classes of 0.55. The model seems to do well predicting Generic malware and Shellcode, but tends to misclassify others as Backdoor or Shellcode.
The ANOVA feature set shows similar results. Accuracy = 51%, Standard Deviation = 0.09. Average weighted f1-score of classes is 0.56. The confusion matrix is also similar, showing most misclassifications in the Shellcode and Backdoor labels.
The features selected with a simple Variance Threshold produce significantly worse results. Accuracy = 29%, Std Dev = 0.078. Average weighted f1-score of classes is 0.35. Almost everything is misclassified as Shellcode here.
The Median Absolute Deviation feature subset shows similar results. Accuracy = 50%, Std Dev = 0.06. Average f1-score = 0.56. Again we see most instances misclassified as Shellcode, although it does well on Generic instances, unlike the variance threshold set.
The features selected by correlation coefficient don't produce impressive results either. Accuracy = 40%, Std Dev = 0.02. Average f1-score = 0.42. Generic and Shellcode fare well, while most others are misclassified under these two labels.
The Recursive Feature Elimination optimal set also fares poorly. Accuracy = 23%, Std Dev = 0.03. Average f1-score = 0.18. It seems to do well with Shellcode and Worms, but nearly all Generic instances are misclassified as Normal.
A comparison of the individual class F1 scores for each feature subset. Most models seem to accurately predict Generic attacks and Normal traffic, but struggle on other classes. These scores have not been normalized.
The accuracy and weighted average of class F1 scores for each subset of features. As seen previously, Info Gain, ANOVA and MAD perform best with accuracy ~50% and average F1 ~0.55.
Recursive feature elimination performs particularly poorly here, most likely because it is simply too many features for this simple model. There are variants of Naive Bayes that may perform better on this data set [11, 13] if an alternate scaling algorithm was used to scale values between 0 and 1, but in general we judge this model is not powerful enough to model this data set so we'll move on for now.
The various feature subsets are compared on the K nearest neighbors model. First the Info Gain subset is tested with several k-values, keeping scikit-learn's default values for the other hyper-parameters (distance = 'minkowski', weight = 'uniform'). After training predictions were made with cross_val_predict to evaluate each model on the training set.
3 neighbors produces Accuracy = 73% and weighted average f1-scores = 0.75. The confusion matrix looks pretty good here, most of the misclassifications appear to be in the Analysis, Backdoor and DoS labels.
5 neighbors looks similar to 3; Accuracy = 75%, average f1-scores = 0.76. Misclassifications look stronger here, most concentrated under Exploits.
7 neighbors looks very similar to 5. Accuracy = 75%, average f1 = 0.76. Most labels seem accurately predicted except Analysis, Backdoor, DoS and Worms. Mostly misclassified as DoS or Exploits.
11 neighbors is much the same as 5 and 7. Accuracy = 75%, average f1-scores = 0.76. Misclassifications look similar, with a slightly higher likelihood of being mislabeled Exploit.
At this point it was decided to use a grid search to find an optimal set of hyper-parameter values and use those values to evaluate the remaining feature subsets.
Scikit-Learn's cross-validated grid-search was used to select the values for k, distance and weight, using 3 folds on the the Info Gain feature subset and all training instances.
The best model returned by the grid-search, with a score of 75% accuracy uses k=11, distance=manhattan, and weights=uniform.
Changing the distance metric to 'manhattan' doesn't have much effect on the 11 neighbors model with the Info Gain features. Accuracy = 76%, average f1-score = 0.76.
The ANOVA features fare slightly worse. Accuracy = 69%, average f1 = 0.69. In addition to Analysis, Backdoor and DoS, Reconnaissance, Shellcode and Worms are now less likely to be classified correctly.
The Variance Threshold features show Accuracy = 76%, average f1-score = 0.76, but Analysis, Backdoor and DoS still likely to be misclassified as Exploits, and Worms as Exploits or Fuzzers.
The MAD features produce Accuracy = 75%, average f1-score = 0.76. Seems better at predicting some labels, but worse at Worms. Analysis, DoS, and Backdoor remain a challenge.
Correlation coefficient selected features look about the same as MAD. Accuracy = 75%, average f1-score = 0.75. Analysis, DoS, and Backdoor consistently split between DoS and Exploits. Worms usually misclassified Exploits.
The Recursive Feature Elimination set gives Accuracy = 75%, average f1-score = 0.75. Analysis, Backdoor, DoS and Worms now more likely to be classified as Exploits.
Comparison of individual class F1 scores for each feature subset. All of the tested feature subsets produce similar results. Analysis, Backdoor, Shellcode and Worms are least likely to be classified correctly. The best results are found on Generic attacks and Normal traffic. These values are not normalized.
Again we see that the features selected don't seem to make much difference in terms of model accuracy or weighted average F1 scores. Although we know that the model performs fair to poorly on about half of the target classes.
Despite the optimal model returned by the grid-search (k=11, distance=manhattan, weights=uniform), the confusion matrix seems to show better results from the first model (k=3, distance=minkowski, weights=uniform).
Let's move on to another model for now.
This exploration uses Scikit-Learn's Linear Support Vector Classification class [15]. A cross-validated grid-search, with 5 folds, suggests the best model uses class_weights=balanced, loss=squared_hinge, dual=False and regularizaton parameter C=3. This model produced an accuracy score of 68% on the Info Gain feature subset. Each feature subset was evaluated on this model using Cross-validated predictions (cross_val_predict) on the training data, with 5 folds.
This model seems to perform quite poorly on the Info Gain feature set. Accuracy = 66%, average f1-score = 0.69. But almost all instances are misclassified as Normal traffic, while Normal is misclassified as Shellcode.
The ANOVA set gives Accuracy = 64%, average f1-score = 0.67, but the confusion matrix looks much better. Analysis and Backdoor are misclassified as DoS, and Reconnaissance as Shellcode, but the others look ok.
The Variance Threshold features don't look great here. Accuracy = 62%, average f1-score = 0.62. Generic attacks and Normal traffic look good. Worms, Fuzzers and Exploits are moderate. Most misclassified as Generic.
The MAD feature set looks the best here, Accuracy = 68%, average f1-score = 0.70. The confusion matrix shows most classes fare well, Analysis and Backdoor misclassifed as DoS, as usual.
Features selected by Correlation coefficient look similar to the ANOVA features. Accuracy = 60%, average f1-score = 0.64. Analysis and Backdoor still misclassified Exploits, Reconnaissance and Exploits also don't do well here.
The data subset using Recursive Feature Elimination and all training instances failed to converge after 15,000 iterations and ~5 hours run time. It's likely that the linear kernel classifier is not well suited to this size data set (~175,000 instances, 165 features). It may be worth exploring other kernels.
Comparison of individual class F1 scores for each feature subset. As with KNN, all feature subsets produce similar results, with the Variance Threshold set being notably worse. As we've seen previously, Generic attacks and Normal traffic fare best, with Analysis, Backdoor, Shellcode and Worms faring worst. These values are not normalized.
The features chosen with this model do not seem to make much difference, other than the model's seeming inability to handle a large number of features. Based on Accuracy, Weighted Average F1 scores and confusion matrix, the MAD features seem to come out ahead here.
So far we notice a trend in the individual class F1 score plots that Generic attacks and Normal traffic are well predicted, while Analysis, Backdoors, Shellcode and Worms fare poorly. This is due to the skewed data set and the fact that these plots are not normalized.
A series of Random Forest classifiers were trained on each of the previously identified feature subsets, using all training instances. As before, several cross-validated grid-searches were conducted to determine good values for the hyper-parameters. The hyper-parameters were tuned using the full training set, including all features. Based on the search results, the model used has the following parameter values: RandomForestClassifier(n_estimators=750, max_leaf_nodes=None, class_weight='balanced_subsample', criterion='gini', min_samples_leaf=2, min_samples_split=12). Performing 3-fold cross-validated predictions on the training set with this model resulted in 74% accuracy, and a weighted average f1-score of 0.77.
The results on the full feature set look pretty good. Most of the misclassifications are Analysis and Backdoor attacks, and misclassified instances are most frequently labelled DoS attacks. Overall accuracy is 74%, and weighted average f1-score is 0.77.
Using the same model with only the features identified by Info Gain, preduces very similar results to the full feature set. 74% Accuracy and 0.77 weighted average f1-score.
And, again, with the ANOVA feature subset the confusion matrix looks much the same, but this time the results are slightly lower. 68% accuracy and 0.71 weighted average f1-score.
Training on the Variance Threshold features produces a similar outcome, Analysis and Backdoor attacks are most likely to be misclassified (usually as DoS). 73% accuracy and 0.76 weighted average f1-score.
The Median Absolute Deviation feature subset also achieves 74% accuracy and 0.77 weighted average f1-score
Using the features with highest correlation coefficient also returns 73% accuracy and 0.76 weighted average f1-score, but this time we see missclassifications of Analysis attacks split more evenly between Backdoor and DoS then we have before.
Finally, we see similar results using the best features from Recursive Feature Elimination. 74% accuracy and 0.77 weighted average f1-score. We see a return of DoS as the most frequently applied incorrect label.
This Random Forest model seems to perform pretty well. Using 3-fold cross-validated predictions on the training set yields an average 73% accuracy and 0.76 weighted average f1-score, measured on various feature subsets. Using the full set of features performs slightly better than some feature subsets.
Of the under-represented classes, this model seems to have no trouble identifying Worms and Shellcode attacks, but still struggles with Analysis and Backdoor attacks. Analysis attacks fair worst here.
Comparison of individual class F1 scores for each feature subset. As we noticed in the confusion matrices, this model still does not do well with the under-represented Analysis and Backdoor, but does better with Shellcode and Worms. These values are not normalized.
Comparison of the accuracy and weighted average F1 scores for each feature set. As with some of the other models, the exact features used in training does not seem to make much difference in cross-validated prediction accuracy. This model performs about equally with our K Nearest Neighbors model, and slightly better than Linear SVC.
To begin our exploration of ANNs, we'll look at a simple Sequential model with fully connected layers.
For the initial test we'll use the Info Gain feature subset (20 features). The first model is based on an example in Hands-On Machine Learning [16], using 2 layers with 300 and 100 nodes, ReLU activation functions, and because we're doing multi-class classification, Softmax activation on the output layer. Recognizing that these layers likely have too many nodes for a data set with only 20 features, a second model was trained with only 20 and 15 nodes in its two hidden layers. These models were trained for 30 epochs with an 80/20 training/validation split.
This model achieves about 78% accuracy on the training set, which seems good. But there's something strange happening, because the model performs better on the validation data than it did on the training data.
Reducing the number of nodes in the hidden layers produces roughly similar results. The accuracy looks good, but the model is performing better on the validation data than the training data.
Investigation of this phenomenon suggests that the problem may be the result of validation data that is not representative of the full data set. On our traditional ML models we used k-fold cross-validation rather than a training/validation hold-out split. Let's implement cross-validation here as well and see if that produces results more like what we would expect.
For the first round of hyperparameter tuning, a combination of cross-validated randomized search and grid search were used [17, 18]. Due to time and processing constraints, each hyperparameter was tuned separately. In this initial round of search the hyperparameters that were investigated are: the optimization algorithm, learning rate, batch size, and number of epochs. The hyperparameters kept fixed during this testing were: the loss function (sparse categorical cross-entropy was used because we're performing multiclassification with integer labels), the activation function (ReLU for hidden layers, Softmax for output), number of hidden layers (four), and nodes per layer ( starting just below the number of inputs and decreasing to the number of output nodes).
In order to make a direct comparison to the previous exploration of traditional ML models, an ANN was trained for each of the five feature subsets identified during initial exploration (info gain, ANOVA, variance threshold, MAD, and correlation), as well as the optimal features identified by the Random Forest classifier. Because these feature subsets are of different sizes, two networks were tuned separately for the two feature sizes.
Tuning of the optimal feature subset suggested the Adamax optimizer with a learning rate of 0.002, trained using a batch size of 20 for 20 epochs. A model with these values was trained on all training instances, and cross-validated predictions were made with 10-folds.
The classification report from these predictions shows an accuracy of 81% and a weighted average f1-score of 0.78. As we would expect the lesser represented classes are not predicted as well as those with an abundance of instances.
Tuning for these feature subsets also suggests the Adamax optimizer with a learning rate of 0.002. This time the model was trained with a batch size of 40 for 50 epochs. Again a model was trained with these values using all training instances, and 10-fold cross-validated predictions were made.
Classification report for cross-validation predictions of the model trained on the Info Gain features. 80% accuracy and 0.77 weighted average f1-score. Standard Scaled and un-scaled confusion matrices for these predictions are shown at right.
Classification report for the ANOVA feature predictions shows 76% accuracy and 0.74 weighted average f1-score. Confusion matrices shown at right. With these features the model seems unable to classify Backdoor and Shellcode attacks.
The Variance Threshold features also show 75% accuracy and 0.74 weighted average f1-score. The confusion matrix at right shows the model trained with these features also cannot classify Backdoor attacks.
The Mean Absolute Deviation features produce 80% accuracy and a 0.77 weighted average f1-score. The scaled confusion matrix to the right appears similar to that of the Info Gain features and the optimal Random Forest features.
The features selected by Correlation Coefficient achieve 79% accuracy and 0.77 weighted average f1-score. As with the ANOVA and Variance Threshold feature sets, this one seems unable to classify Backdoor attacks.
As has been the case with the traditional ML models we've explored, the six feature subsets used here produce very similar results. The optimal features identified by the Random Forest classifier seem to slightly outperform the other feature sets chosen. At this point the Sequential Neural Network model produces general results slightly better than those of the Random Forest and K-Nearest Neighbor classifiers, although it underperforms the Random Forest on DoS, Shellcode and Worm attacks. Hopefully with further tuning we can achieve better than 80% accuracy.
After individually hand-tuning each hyperparameter through grid-searching, the decision was made to use Optuna [20] to tune multiple hyperparameters together in the hope that better results could be obtained.
A model using the Random Forest optimum feature set was tuned first; the loss function was set to sparse_categorical_crossentropy, and the optimizer function to Adamax. The model was defined with four dense layers, having 48, 36, 24, and 16 nodes respectively. Each dense layer was followed by a drop-out layer. Activation function, drop-out rate, weight-constraint, learning rate, batch size and number of epochs were tuned together for a total of 100 trials. For these trials the model was fit to a 70/30 stratified train/validation split of the optimum features data set. This resulted in an accuracy of 80% with the following hyperparameter values and importances:
Checking the hyperparameter importances reveals that in this model the drop-out rate is by far the most important contributor to the models accuracy.
Next, a test was made on the number of layers and neurons per layer. Using the hyperparameter values listed above, a new study was made for 20 trials. This study found that a model using these values had a validation accuracy of 81% when using four layers with 100 neurons per layer. These results are similar to what was achieved through hand-tuning with grid-search.
For a strict comparison to the other models, Optuna was also used to optimize models trained on the other feature subsets (Info Gain, ANOVA, Variance Threshold, MAD, and Correlation). We start with a model with four dense layers, with 20, 18, 16, and 12 neurons respectively, each followed by a drop-out layer. The loss function is set to sparse_categorical_crossentropy and the optimizer to Adamax. A study of 20 trials on this model produced 79.7% accuracy with the following hyperparameter values and importances:
Again, we see the drop-out rate is by far the most important hyperparameter.
Similar to the network trained on the Random Forest optimum features, this one seems to be hovering right around 80% accuracy. This is also consistent with the 10-fold cross-validation conducted above, using hyperparameter values found via grid-search, even with a different activation function. Due to several classes being underrepresented in the data set, this level of accuracy may be the best that's possible. Interestingly, this level of accuracy seems to be achievable regardless of which subset of features is used. A test was made to see if class weighting would improve the results, but the result was a slightly lower accuracy.
Again, another study was conducted to determine if the number of layers and neurons per layer would impact the results. The hyperparameter values reported above were used for this test. After 20 trials, a validation accuracy of 81% was returned for a model using four layers and 46 neurons per layer.
Finally, a study was made with pruning enabled to tune all hyperparameters together. This study used sparse_categorical_crossentropy loss function, and all other hyperparameter, including optimizer, were tuned. The model was fit to a 70/30 stratified train/validation split of the Info Gain feature set. 100 trials were run, and an accuracy of 81.35% was found with the following hyperparameter values and importances:
Once again, we see the drop-out rate is the largest contributor.
We can also visualize the intermediate values of the trials performed.
These hyperparameter values will be used to create and train models for the final evaluation of each feature subset.
Finally, Optuna was used to optimize a model trained on the full feature data set. As with the Random Forest optimum feature model, the loss function was set to sparse_categorical_crossentropy and the optimizer to Adamax. The model was fit to a 70/30 stratified train/validation split of the full feature data set. A study of 100 trials produced 81.64% validation accuracy with the following hyperparameter values and importances:
Here, the learning rate, followed by the drop-out rate, is the most important hyperparameter.
Plotting the intermediate values gives us an idea how each branch of the study performed.
The best parameter values found will be used to train a model for final evaluation of the full feature data set.
This section presents the final hyperparameter values and network architectures for each ML model. A copy of each model was trained on each of the six feature subsets, using all training instances. Each model was then used to make predictions on the corresponding test set. The traditional models were coded in Python with Scikit-Learn, and the Neural Networks with TensorFlow/Keras. Any parameters not shown used default values.
GaussianNB - no parameters
KNeighborsClassifier(n_neighbors=11, weights='uniform', metric='manhattan')
LinearSVC(C=3, loss='squared_hinge', max_iter=15000, dual=False, class_weight='balanced')
RandomForestClassifier(n_estimators=750, max_leaf_nodes=None, class_weight='balanced_subsample', criterion='gini', min_samples_leaf=2, min_samples_split=12)
Below are the results of the test predictions made using the final models. The best results were achieved using the random forest-identified optimum feature set with a Sequential Artificial Neural Network. The hand-tuned model and that tuned with Optuna performed similarly, with the Optuna model winning out slightly with 75.3% accuracy on the test set. These results are in spite of the fact that the data set was skewed, with several target classes being under-represented.
Final metrics for predictions made on the test set for each model and feature subset. These plots show the Accuracy, as well as weighted average Precision, Recall, and F1 score for each combination of model and feature subset.
Confusion matrices for test predictions made on each model and feature subset. These plots give us more information about the performance of each model and feature subset on individual target classes.
Excepting the Naive Bayes model, which is not well suited to this data set, the highest accuracy achieved on nearly every model was with the optimum features identified by the Random Forest.
The final prediction accuracy for each model and feature subset are shown at left. The results for models trained on all data features were not included in the plot above because the full feature set was only used with the Random Forest and ANN model tuned with Optuna.
Machine Learning algorithms are capable of learning network traffic data and classify types of activitiy. Normalized confusion matrices of predictions suggest that even those classes which were under-represented and predicted poorly could be correctly classified given enough examples.
While a Sequential Artificial Neural Network produced the best results, some traditional ML algorithms (specifically K-Nearest Neighbors) produces classifications nearly as accurate as the ANN.
Many of the identified features of network traffic and values of categorical features seem to offer little contribution to accurate classification of activity type. The 50 most important features identified by the Random Forest produced better accuracy than either the full feature set or the smaller sets tested on most models; interestingly, the features selected by Information Gain and Correlation Coefficient performed almost as well as the 50 most important features on the K-Nearest Neighbors model.
Using a hyper-parameter tuning tool such as Optuna [20] can produce better models in less time than hand-tuning with grid-search.
Collect more data - this data set was skewed in the number of instances of each target class, and a model cannot learn behavior it hasn't seen. The models tested performed reasonably well with the data available, and I believe collecting more examples of the under-represented classes will allow these models to achieve a very high accuracy.
Explore ensemble methods - Some of the model/feature set combinations tested performed better than others on the under-represented classes. Combining these models in an ensemble would likely produce better results even without having to collect more data.
During this project I have accomplished my goal of comparing the effectiveness of various ML algorithms in learning network traffic behavior and classifying type of activity. I have also achieved my secondary goal of increasing my knowledge and experience of Machine Learning methods, specifically Artificial Neural Networks, and I learned how to optimize models using the Optuna library.
N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set),” 2015 Military Communications and Information Systems Conference (MilCIS), 2015.
N. Moustafa and J. Slay, “The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set,” Information Security Journal: A Global Perspective, vol. 25, no. 1-3, pp. 18–31, 2016.
D. Gibert, C. Mateu, and J. Planes, “The rise of machine learning for detection and classification of malware: Research developments, trends and challenges,” Journal of Network and Computer Applications, vol. 153, p. 102526, 2020.
DBIR Team, “2020 DBIR Summary of Findings,” Verizon Enterprise. [Online]. Available: https://enterprise.verizon.com/resources/reports/dbir/2020/summary-of-findings/. [Accessed: 07-Feb-2021].
S. Sharma, C. Rama Krishna, and S. K. Sahay, “Detection of Advanced Malware by Machine Learning Techniques,” Advances in Intelligent Systems and Computing, pp. 333–342, 2018.
A. Souri and R. Hosseini, “A state-of-the-art survey of malware detection approaches using data mining techniques,” Human-centric Computing and Information Sciences, vol. 8, no. 1, 2018.
N. Idika and A. Mathur, “A Survey of Malware Detection Techniques,” Purdue University, Mar. 2007.
aman1608, “Feature Selection Techniques in Machine Learning,” Analytics Vidhya, 02-Dec-2020. [Online]. Available: https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/. [Accessed: 28-Feb-2021].
“Median absolute deviation,” Wikipedia, 09-Dec-2020. [Online]. Available: https://en.wikipedia.org/wiki/Median_absolute_deviation. [Accessed: 28-Feb-2021].
M. Aggarwal, “Covariance & Correlation,” Medium, 02-Jan-2018. [Online]. Available: https://medium.com/@thecodingcookie/covariance-correlation-def860c4d4ab. [Accessed: 28-Feb-2021].
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, 01-Jan-1970. [Online]. Available: https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html. [Accessed: 14-Mar-2021].
Scikit-Learn, “sklearn.feature_selection.RFECV,” scikit. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html. [Accessed: 14-Mar-2021].
Scikit-Learn, “Naive Bayes,” scikit. [Online]. Available: https://scikit-learn.org/stable/modules/naive_bayes.html. [Accessed: 14-Mar-2021].
Scikit-Learn, “sklearn.model_selection.GridSearchCV,” scikit. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html. [Accessed: 14-Mar-2021].
Scikit-Learn, “sklearn.svm.LinearSVC,” scikit. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html. [Accessed: 14-Mar-2021].
A. Géron, “Chapter 10: Introduction to Artificial Neural Networks with Keras,” in Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems, 2nd ed., Sebastopol, CA: O'Reilly Media, Inc., 2019, p. 299.
A. Géron, “Introduction to Artificial Neural Networks with Keras: Fine-Tuning Neural Network Hyperparameters,” in Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems, Sebastopol, CA: O'Reilly Media, Inc., 2019, pp. 320–327.
J. Brownlee, “How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras,” Machine Learning Mastery, 27-Aug-2020. [Online]. Available: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/. [Accessed: 11-Apr-2021].
C. Versloot, “K-fold Cross Validation with TensorFlow and Keras,” MachineCurve, 18-Feb-2020. [Online]. Available: https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/. [Accessed: 11-Apr-2021].
“A hyperparameter optimization framework,” Optuna. [Online]. Available: https://optuna.org/. [Accessed: 25-Apr-2021].