The Data
We used the IoT-23 dataset which was created by the Stratosphere Lab, CTU University in Prague, Czech Republic. The dataset encompasses real network traffic from 23 IoT devices captured between 2018 and 2019 and was published in January 2020. Some of the IoT devices from which network traffic was captured for this dataset include an Amazon Echo, Phillips Hue Smart Bulb, and Somfy Smart Door Locks. The dataset was vast, over 21 gigabytes, we worked with a sample of it. The sample represented ~20% of the total data and had over 8 million rows of network traffic and 20 columns ranging from IP addresses to connection durations. Of the 8 million rows, 91% of traffic was labeled as benign while 9% was malicious.
Data Engineering
While the dataset was mostly clean, there were some minor issues with it. We first changed and formatted the data types of the various features to prepares them for the models we intended to use. We ended up dropping certain columns for being primarily null, for other columns with fewer nulls we replaced the null values with either mean or median values. We also normalized a subset of our features and encoded categorical variables with numeric values. We then split our sample dataset 70/30 for training and testing sets. Our training matrix looks a little bit like this:
Model Selection
Since our target variable was a binary classification of whether traffic was malicious or benign, we worked with an ensemble of classification algorithms. These included a Logistic Regression model with over/under sampling, Random Forests, and XGBoost. Our main goal here was to minimize the false negatives as wrongly classifying malicious traffic as benign can have serious adverse effects, so our models did have a slight bias towards false positives. XGBoost was the best performing model, followed by Random Forests, and then Logistic Regression. The XGBoost model was incredibly accurate, misclassifying only 12 out of nearly ~2.5 million rows as benign. The accuracy rate of the XGBoost model was 99.99%!