In this Module, we will be implementing a Support Vector Machine for network intrusion detection in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following: duration, protocol_type, service, flag, src_bytes, dst_bytes, land, wrong_fragment, urgent, num_shells, num_access_files, num_outband_cmds, is_guest_login, num_file_creations, num_root, protocol_type and many more.
Copy and paste the following link to open google colab
https://colab.research.google.com/notebooks/welcome.ipynb
Then click File --> New notebook
Click the red box area in the website and change the file name to NID Classification SVM.ipynb
Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)
# importing necessary packeges
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, f1_score, accuracy_score
%matplotlib inline
Import the dataset in the google colab system which is a temporary storage system
Now, reading the dataset in the notebook
https://www.kaggle.com/sampadab17/network-intrusion-detection
# reading the dataset
dataset = pd.read_csv("NID_Train_data.csv")
dataset.head()
We drop all the observations containing any missing values, as we have good amount of data, dropping few observations won’t hurt further analysis
# dropping observations with missing values if any
dataset = dataset.dropna()
As this is an supervised classification problem, we separate the data into independent variable (y) and dependent variables (matrix: X)
# seprating dependent and independent variables
X = dataset.drop("class", axis=1)
y = dataset["class"]
From the dependent variables, we drop some unnecessary variables which are id such as “num_outband_cmds”
#droping unnecessary colums as id
X.drop(['num_outbound_cmds','is_host_login'], axis=1, inplace=True)
# print(X)
X.head()
Since the y variable is a categorical variable, we must convert it to a discrete variable of two classes: class 1 (abnormal) & class 0 (normal) to prepare for further analysis
# encoding y variable
y = np.where(y == "normal", 0, 1)
There are few independent variables are categorical variable, before fitting these variable we must convert these to numerical values, hence we perform a dummy variable conversion
# transforming the categorical variables into dummy variabe
X = pd.concat([pd.get_dummies(X.select_dtypes(include= 'object')), X.select_dtypes(exclude= 'object')], axis = 1)
In the model developing process, we split the data into training and test part, where training data will be used for training the model and test data will be used for model evaluation
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)
As different independent variables are in different range, we perform a normalization method to transform all variables into same scale
# normilizing the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now, to build the model, we first create method: svc which is used for support vector machine classification
# SVC MODEL
from sklearn.svm import SVC
svc = SVC(random_state = 42)
In this step, model is fitted with the scaled trained data
# fitting the model
svc.fit(X_train_scaled, y_train)
We predict on the scaled test data
# predicting
y_pred = svc.predict(X_test_scaled)
After the prediction, now we are ready to evaluate the model performance. At first we will see the confusion metrics:
confusion_matrix(y_test, y_pred)
To more understand the model performance we can check the individual class performance
print(classification_report(y_test, y_pred))
Finally we can also check the overall accuracy score and f1 score, where f1 score is another method to check the performance of the model
f1_score(y_test, y_pred)
accuracy_score(y_test, y_pred)