Hands-on Lab Practice

In this Module, we will be implementing a Support Vector Machine for network intrusion detection in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following: duration, protocol_type, service, flag, src_bytes, dst_bytes, land, wrong_fragment, urgent, num_shells, num_access_files, num_outband_cmds, is_guest_login, num_file_creations, num_root, protocol_type and many more.

Copy and paste the following link to open google colab

https://colab.research.google.com/notebooks/welcome.ipynb

Then click File --> New notebook

Click the red box area in the website and change the file name to NID Classification SVM.ipynb

Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)

# importing necessary packeges

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, classification_report, f1_score, accuracy_score

%matplotlib inline

Import the dataset in the google colab system which is a temporary storage system

Now, reading the dataset in the notebook

https://www.kaggle.com/sampadab17/network-intrusion-detection

# reading the dataset

dataset = pd.read_csv("NID_Train_data.csv")

dataset.head()

We drop all the observations containing any missing values, as we have good amount of data, dropping few observations won’t hurt further analysis

# dropping observations with missing values if any

dataset = dataset.dropna()

As this is an supervised classification problem, we separate the data into independent variable (y) and dependent variables (matrix: X)

# seprating dependent and independent variables

X = dataset.drop("class", axis=1)

y = dataset["class"]

From the dependent variables, we drop some unnecessary variables which are id such as “num_outband_cmds”

#droping unnecessary colums as id

X.drop(['num_outbound_cmds','is_host_login'], axis=1, inplace=True)

# print(X)

X.head()

Since the y variable is a categorical variable, we must convert it to a discrete variable of two classes: class 1 (abnormal) & class 0 (normal) to prepare for further analysis

# encoding y variable

y = np.where(y == "normal", 0, 1)

There are few independent variables are categorical variable, before fitting these variable we must convert these to numerical values, hence we perform a dummy variable conversion

# transforming the categorical variables into dummy variabe

X = pd.concat([pd.get_dummies(X.select_dtypes(include= 'object')), X.select_dtypes(exclude= 'object')], axis = 1)

In the model developing process, we split the data into training and test part, where training data will be used for training the model and test data will be used for model evaluation

#train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

As different independent variables are in different range, we perform a normalization method to transform all variables into same scale

# normilizing the data

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

Now, to build the model, we first create method: svc which is used for support vector machine classification

# SVC MODEL

from sklearn.svm import SVC

svc = SVC(random_state = 42)

In this step, model is fitted with the scaled trained data

# fitting the model

svc.fit(X_train_scaled, y_train)

We predict on the scaled test data

# predicting

y_pred = svc.predict(X_test_scaled)

After the prediction, now we are ready to evaluate the model performance. At first we will see the confusion metrics:

confusion_matrix(y_test, y_pred)

To more understand the model performance we can check the individual class performance

print(classification_report(y_test, y_pred))

Finally we can also check the overall accuracy score and f1 score, where f1 score is another method to check the performance of the model

f1_score(y_test, y_pred)

accuracy_score(y_test, y_pred)

Page updated

Google Sites

Report abuse