Hands-on Lab Practice

In this Module, we will be implementing Deep learning for malware classification and protection in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following: transact, onServiceConnected, bindService, attachInterface, ServiceConnection, SET_PREFERRED_APPLICATIONS, WRITE_SECURE_SETTINGS, send_sms, getBinder, get_accounts, recieve_sms, getcallingUid, use_credentials, manage_accounts, keyspec, duration, service, class, and many more.

Hands-on-Lab using Google Colab: Android Malware Detection with Deep Neural Network with 5-Folds Cross-Validation

Copy and paste the following link to open google colab:

https://colab.research.google.com/notebooks/welcome.ipynb

Then click File --> New notebook

Click the red box area in the website and change the file name Android Malware Detection

Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)

When you copy the code, there is a high chance that you may get indentation error. To resolve this problem please go to this link where you will get the same code but with proper indentation. Link: https://colab.research.google.com/drive/1bYwBo8ZCIxg4reMTX6E5D1nIJiQQAm_V

On the first code cell copy and paste the following code

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.model_selection import StratifiedKFold

le = LabelEncoder()

And the click the run button (looks like play button) to run this code cell

Now we have to upload the datasets to google colab

from google.colab import files

uploaded = files.upload()

After successful execution of that cell, you should be able to see the same result like the following picture. Next step: click choose to upload the dataset into google collab from your local folder. Download the Dataset: drebin.csv

It might take few minutes to upload the file.

Next, create a new cell, copy and paste the following code and run it. The purpose of this code is to load and pre-processing the data. Display the first five observations of the data.

data = pd.read_csv('drebin.csv')

data.head()

To check the data shape, run the following code. Here the return is (15031, 216), meaning that the data contains 15031 observations (rows) and 216 attributes (columns).

To fill the missing values as np.nan format to ease further work and reset the index.

data.fillna(value=pd.np.nan, inplace=True)

data = data[(data.astype(str) != '?').all(axis=1)]

data = data.reset_index(drop = True)

The dataset contains both sets of independent variables and the dependent variable (whether the android application is malware or benign). The following code separates the dependent variable and displays the first five observations.

y = data.iloc[:,-1:]

y.head()

Since the dependent variable is in character format, it requires to be transformed into numerical form (numpy array). The following code will do that.

y = le.fit_transform(y)

The following code separates the set of the independent variable and then converts it into numpy array form.

X = data.iloc[:,:-1]

X = np.array(X)

The following code import the required model logistic regression and metrics for evaluation purpose.

from sklearn.linear_model import LogisticRegression

from sklearn import metrics

The following code defines a function that takes an input of train data (Xtrain, ytrain comes from training data and Xtest, ytest comes from test data) and returns fbeta score, true positive rate (TPR), false-positive rate (FPR)

def mal_detection_LR(Xtrain, Xtest, ytrain, ytest):

LR_classifier = LogisticRegression()

LR_classifier.fit(Xtrain, ytrain) # Train the model

pred_prob = LR_classifier.predict_proba(Xtest)[:, 1] # Predict probabilities for the test data

pred = (pred_prob > 0.5) # Apply threshold to convert probabilities into binary predictions

fbeta = metrics.fbeta_score(ytest, pred, average='weighted', beta=10) # Calculate F-beta score

tn, fp, fn, tp = metrics.confusion_matrix(ytest, pred).ravel() # Calculate confusion matrix

tpr = tp / (tp + fn) # Calculate true positive rate (TPR)

fpr = fp / (fp + tn) # Calculate false positive rate (FPR)

return fbeta, tpr, fpr

The purpose of this step is to perform 5-fold cross-validation with stratified sampling and apply the defined function. This portion of code provides results for each of the five folds.

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cvscores_LR = []

for train, test in kfold.split(X, y):

fbeta, tpr, fpr = mal_detection_LR(X[train], X[test], y[train], y[test])

cvscores_LR.append((fbeta, tpr, fpr))

To access each of the folds results we must type:

cvscores_LR

The following code decorates the results and shows a table.

pd.DataFrame(cvscores_LR, columns= ['F-beta Score', 'TPR', "FPR"])

The following code is from defining Support Vector Machine, like the previous one

from sklearn.svm import SVC

def mal_detection_SVM(Xtrain, Xtest, ytrain, ytest):

SVM_classifier = SVC(kernel = 'rbf', probability = True)

SVM_classifier.fit(Xtrain, ytrain)

pred_prob = SVM_classifier.predict_proba(Xtest)[:,1]

pred = (pred_prob > 0.5)

fbeta = metrics.fbeta_score(ytest, pred, average='weighted', beta = 10)

tn, fp, fn, tp = metrics.confusion_matrix(ytest, pred).ravel()

tpr = tp/(tp+fn)

fpr = fp/(fp+tn)

return fbeta, tpr, fpr

kfold = StratifiedKFold(n_splits= 5, shuffle=True, random_state = 42)

cvscores_SVM = []

for train, test in kfold.split(X, y):

cvscores_SVM.append(mal_detection_SVM(Xtrain = X[train], Xtest = X[test], ytrain= y[train], ytest = y[test]))

To build a neural network, the following code will import the necessary packages.

import keras

from keras.models import Sequential

from keras.layers import Dense, Dropout, Activation

from keras import optimizers, regularizers

from sklearn import metrics

from keras.regularizers import l1, l2

from keras.callbacks import EarlyStopping, ModelCheckpoint

A neural network with one input layer, three hidden layers, and an output layer is developed in the following code.

from sklearn.model_selection import StratifiedKFold

from sklearn.neural_network import MLPClassifier

from sklearn import metrics

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

from tensorflow.keras import regularizers

from tensorflow.keras.callbacks import EarlyStopping

import matplotlib.pyplot as plt

def mal_detection_NN(Xtrain, Xtest, ytrain, ytest, num_epoch):

Xtrain = Xtrain.astype('float32')

Xtest = Xtest.astype('float32')

ytrain = ytrain.astype('float32')

ytest = ytest.astype('float32')

mal_classifier = Sequential()

# first hidden layer

mal_classifier.add(Dense(units=25, activation='relu',

activity_regularizer=regularizers.l2(0.000001), input_dim=215))

mal_classifier.add(Dropout(0.2))

# 2nd Hidden layer

mal_classifier.add(Dense(units=15, activation='relu',

activity_regularizer=regularizers.l2(0.000001)))

mal_classifier.add(Dropout(0.2))

# 3rd hidden layer

mal_classifier.add(Dense(units=5, activation='relu',

activity_regularizer=regularizers.l2(0.000001)))

# output layer

mal_classifier.add(Dense(units=1, activation='sigmoid'))

# defining optimizer

adam = tf.keras.optimizers.Adam()

# compiling model

mal_classifier.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])

# defining callback to early stop monitoring on validation loss

monitor = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=0, mode='auto',

restore_best_weights=True)

callbacks_list = [monitor]

# fitting the training data

history = mal_classifier.fit(Xtrain, ytrain, validation_data=(Xtest, ytest), callbacks=callbacks_list,

epochs=num_epoch, batch_size=64)

# predicting on test data

pred_prob = mal_classifier.predict(Xtest)

pred = (pred_prob > 0.5)

# evaluating the model

fbeta = metrics.fbeta_score(ytest, pred, average='weighted', beta=10)

tn, fp, fn, tp = metrics.confusion_matrix(ytest, pred).ravel()

tpr = tp / (tp + fn)

fpr = fp / (fp + tn)

# summarize history for loss

plt.plot(history.history['loss'])

plt.plot(history.history['val_loss'])

plt.title('model loss')

plt.ylabel('loss')

plt.xlabel('epoch')

plt.legend(['train', 'test'], loc='upper left')

plt.show()

return fbeta, tpr, fpr

This step is to stratify cross validation with the neural network classifier mentioned above:

kfold = StratifiedKFold(n_splits= 5, shuffle=True, random_state = 42)

cvscores_NN = []

for train, test in kfold.split(X, y):

cvscores_NN.append(mal_detection_NN(Xtrain = X[train], Xtest = X[test], ytrain= y[train],

ytest = y[test], num_epoch = 100 ))

The following code is used for comparing the results of using the above three classifiers (LR, NN, and SVM) in terms of F-beta score.

fbeta_nn = pd.DataFrame([a for a,b,c in cvscores_NN], columns= ['Neural Netork'])

fbeta_svm = pd.DataFrame([a for a,b,c in cvscores_SVM], columns= ['SVM'])

fbeta_lr = pd.DataFrame([a for a,b,c in cvscores_LR], columns= ['LR'])

fbeta_all = pd.concat([fbeta_nn, fbeta_svm, fbeta_lr], axis = 1)

fbeta_all

When you copy the code, there is a high chance that you may get indentation error. To resolve this problem please go to this link where you will get the same code but with proper indentation. Link: https://colab.research.google.com/drive/1bYwBo8ZCIxg4reMTX6E5D1nIJiQQAm_V