This lab will use Python, Pandas, Seaborn, Matplotlib, Pennylane and scikit-learn.
It may be necessary to install Pandas, Seaborn, Matplotlib, Pennylane and/or scikit-learn. From the terminal (macOS), console (linux) or command prompt (Windows), you can see what's installed on your system with the following command:
pip list
If one or more of the above requirements is missing, it can be installed using pip as follows:
pip install pandas
pip install seaborn
pip install matplotlib
pip install pennylane
pip install scikit-learn
Write the following code to import the required libraries:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import seaborn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import pennylane as qml
from pennylane import numpy as np
Download the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/
Load the data set:
path = 'dataset/kddcup.data_10_percent/kddcup.data_10_percent'
df = pd.read_csv(path, header=None)
The CVS file for the data set does not have column heads, so add them:
df.columns = [
'duration',
'protocol_type',
'service',
'flag',
'src_bytes',
'dst_bytes',
'land',
'wrong_fragment',
'urgent',
'hot',
'num_failed_logins',
'logged_in',
'num_compromised',
'root_shell',
'su_attempted',
'num_root',
'num_file_creations',
'num_shells',
'num_access_files',
'num_outbound_cmds',
'is_host_login',
'is_guest_login',
'count',
'srv_count',
'serror_rate',
'srv_serror_rate',
'rerror_rate',
'srv_rerror_rate',
'same_srv_rate',
'diff_srv_rate',
'srv_diff_host_rate',
'dst_host_count',
'dst_host_srv_count',
'dst_host_same_srv_rate',
'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate',
'dst_host_serror_rate',
'dst_host_srv_serror_rate',
'dst_host_rerror_rate',
'dst_host_srv_rerror_rate',
'label'
]
Print some useful information and some graphs/plots to visualize the data before preprocessing. This will make it easier for us to make decisions on how we will preprocess the data:
print("Read {} rows.".format(len(df)))
print('Datapoints:', df.shape[0])
print('Feature number:', df.shape[1])
label = set(df['label'].values)
print('Label types:', label)
print('Label number:', len(label))
plt.figure(figsize=(15,7))
df['label'].value_counts().plot(kind="bar")
df['protocol_type'].value_counts().plot(kind="bar")
plt.figure(figsize=(15,3))
df['service'].value_counts().plot(kind="bar")
Drop columns with NaN:
df = df.dropna('columns')
Keep columns where there are more than one unique value:
df = df[[col for col in df if df[col].nunique() > 1]]
Print each correlation. Number varies between -1 to 1. 1 means there is a 1 to 1 relationship (perfect correlation). 0.9 is also a good relationship. 0.2 is an example of a bad relationship.
print(df['num_root'].corr(df['num_compromised']))
print(df['srv_serror_rate'].corr(df['serror_rate']))
print(df['srv_count'].corr(df['count']))
print(df['srv_rerror_rate'].corr(df['rerror_rate']))
print(df['dst_host_same_srv_rate'].corr(df['dst_host_srv_count']))
print(df['dst_host_srv_serror_rate'].corr(df['dst_host_serror_rate']))
print(df['dst_host_srv_rerror_rate'].corr(df['dst_host_rerror_rate']))
print(df['dst_host_same_srv_rate'].corr(df['same_srv_rate']))
print(df['dst_host_srv_count'].corr(df['same_srv_rate']))
print(df['dst_host_same_src_port_rate'].corr(df['srv_count']))
print(df['dst_host_serror_rate'].corr(df['serror_rate']))
print(df['dst_host_serror_rate'].corr(df['srv_serror_rate']))
print(df['dst_host_srv_serror_rate'].corr(df['serror_rate']))
print(df['dst_host_srv_serror_rate'].corr(df['srv_serror_rate']))
print(df['dst_host_rerror_rate'].corr(df['rerror_rate']))
print(df['dst_host_rerror_rate'].corr(df['srv_rerror_rate']))
print(df['dst_host_srv_rerror_rate'].corr(df['rerror_rate']))
print(df['dst_host_srv_rerror_rate'].corr(df['srv_rerror_rate']))
Print a heatmap to visualize the correlations:
corr = df.corr()
plt.figure(figsize=(15,12))
seaborn.heatmap(corr)
plt.show()
Drop those will high correlations:
df.drop('num_root', axis = 1, inplace = True)
df.drop('srv_serror_rate', axis = 1, inplace = True)
df.drop('srv_rerror_rate', axis = 1, inplace=True)
df.drop('dst_host_srv_serror_rate', axis = 1, inplace=True)
df.drop('dst_host_serror_rate', axis = 1, inplace=True)
df.drop('dst_host_rerror_rate', axis = 1, inplace=True)
df.drop('dst_host_srv_rerror_rate', axis = 1, inplace=True)
df.drop('dst_host_same_srv_rate', axis = 1, inplace=True)
We don't need "service" from the data:
df.drop('service', axis = 1, inplace= True)
Map features. For the label_map, we set normal to 0 and everything else to 1. In other words, 1 is "bad" while 0 is "good".
protocol_type_map = {'icmp': 0,'tcp': 1,'udp': 2}
df['protocol_type'] = df['protocol_type'].map(protocol_type_map)
flag_map = {
'SF': 0,
'S0': 1,
'REJ': 2,
'RSTR': 3,
'RSTO': 4,
'SH': 5,
'S1': 6,
'S2': 7,
'RSTOS0': 8,
'S3': 9,
'OTH': 10
}
df['flag'] = df['flag'].map(flag_map)
label_map = {
'normal.': 0,
'back.': 1,
'buffer_overflow.': 1,
'ftp_write.': 1,
'guess_passwd.': 1,
'imap.': 1,
'ipsweep.': 1,
'land.': 1,
'loadmodule.': 1,
'multihop.': 1,
'neptune.': 1,
'nmap.': 1,
'perl.': 1,
'phf.': 1,
'pod.': 1,
'portsweep.': 1,
'rootkit.': 1,
'satan.': 1,
'smurf.': 1,
'spy.': 1,
'teardrop.': 1,
'warezclient.': 1,
'warezmaster.': 1,
}
df['label'] = df['label'].map(label_map)
Extract features by selecting all rows and all columns except for the last column, which contains labels of the dataset.
X = df.iloc[:, :-1].values
Extract the labels by selecting all rows and only the last column:
y = df.iloc[:, -1].values
Split into train and test:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Write classical model:
svmc_model = SVC(gamma = 'scale')
start_time = time.time()
svmc_model.fit(X_train, Y_train)
end_time = time.time()
print("SVM Training time: ", end_time-start_time)
start_time = time.time()
prediction = svmc_model.predict(X_test)
end_time = time.time()
print("SVM Testing time:", end_time-start_time)
print("SVM Train score is:", svmc_model.score(X_train, Y_train))
print("SVM Test score is:", svmc_model.score(X_test, Y_test))
dev = qml.device('default.qubit', wires=3)
Write quantum circuit, which is based on the work from the "Quantum Support Vector Machines" paper by Havlicek et al. More information is given in the pre lab.
def feature_map(x):
for i in range(3):
qml.Hadamard(wires=i)
qml.RY(x[i], wires=i)
qml.CNOT(wires=[0, 1])
qml.CNOT(wires=[1, 2])
qml.RZ(x[0] * x[1], wires=2)
qml.CNOT(wires=[1, 2])
qml.RY(x[1] * x[2], wires=2)
qml.CNOT(wires=[0, 1])
qml.CNOT(wires=[1, 2])
qml.RZ(x[0] * x[2], wires=2)
qml.CNOT(wires=[1, 2])
Write the kernel function. More information is given in the pre lab.
def kernel(x1, x2):
feature_map(x1)
qml.adjoint(feature_map)(x2)
return qml.expval(qml.PauliZ(2))
Define the quantum SVM model:
svm = SVC(kernel=kernel, C=1.0)
Train and test the model:
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
Compute the accuracy:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')