The dataset in this module is used for computer network intrusion detection
The dataset has attributes: duration, protocol type, service, source bytes, destination bytes, flag, land, wrong fragments, urgent, number of hot indicators, number of failed logging attempts, logging in, number of compromised conditions, number of root access, number of shell prompts, number of outbound commands, hot login, guest login, number of connection to the same host, percentage of connections that have "SYN" errors, percentage of connections that have "REJ"" errors, percentage of connections to the same service, percentage of connections to different service, and percentage of connections to different hosts.
In this Module, we will be implementing Neural network algorithms for network DoS detection in a new Google Colab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following:
duration, protocol_type, service, flag, dst_host_srv_rerror_rate, label, and many more.
Copy and paste the following link to open google colab
Then click File --> New notebook
Click the red box area on the website and change the file name to Neural network algorithms for network DoS detection.
Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU).
On the first code cell copy and paste the following code to upload the dataset to google colab.
from google.colab import files
uploaded = files.upload()
And the click the run button(looks like play button) to run this code cell.
After successful execution of that cell you should be able to see the same result like the following picture. Next step, click choose upload DoS dataset (kddcup99.csv) into google colab. This is the link for kddcup99 dataset.
https://drive.google.com/drive/folders/1-V4PZys9AVvHsCletTcNnEBBCM4h13bX?usp=sharing
(we will find our kddcup99.csv file in the archive after we unzipping).
It might take more than 10 mins to upload this file because this dataset is extremely large (494020 sample provided).
Next, Create a new cell, copy and paste the following code and run it.
The purpose of this code is to receive the data from the csv file. Then, converting the string data to integer data and normalize the dataset.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from keras.utils import to_categorical
dataset = pd.read_csv("kddcup99.csv", sep=',')
dataset.head()
X = dataset.iloc[:, 0:41]
y = dataset.iloc[:, 41]
X = X.to_numpy()
y = y.to_numpy()
le = LabelEncoder()
sc = StandardScaler()
X[:, 1] = le.fit_transform(X[:, 1])
X[:, 2] = le.fit_transform(X[:, 2])
X[:, 3] = le.fit_transform(X[:, 3])
X = sc.fit_transform(X)
# Encode class values as integers
encoder = LabelEncoder()
encoded_y = encoder.fit_transform(y)
# Convert integers to dummy variables
dummy_y = to_categorical(encoded_y)
Create a new cell, copy and paste the following code and run it.
We split the dataset into training set and test set. The split rate is 0.1. (it means that the training set is 90% of the whole dataset and the test set is 10% of the whole dataset).
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,dummy_y,test_size = 0.1)
After feature scaling, we will create a neural network and do the training work.
each epoch means that we apply all our training datasets to our training model
Create a new cell, copy and paste the following code and run it.
import keras
from keras.models import Sequential
from keras.layers import Dense
# Build the neural network
model = Sequential()
model.add(Dense(16, input_dim=41, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(23, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
c = model.fit(X_train, y_train, epochs=5, batch_size=64)
After training, we should convert the label to fit prediction model. Finally, we run our prediction model and we will see the accuracy on the test set.
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
#Converting predictions to label
pred = list()
for i in range(len(y_pred)):
pred.append(np.argmax(y_pred[i]))
#Converting one hot encoded test label to back to original label
test = list()
for i in range(len(y_test)):
test.append(np.argmax(y_test[i]))
y_pred = model.predict(X_test)
accuracy = accuracy_score(pred,test)
print("The accuracy is: {:.2%}".format(accuracy))
As we can see, the accuracy is 99.89%.