In this Module, we will be implementing Deep learning for malware classification and protection in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following: transact, onServiceConnected, bindService, attachInterface, ServiceConnection, SET_PREFERRED_APPLICATIONS, WRITE_SECURE_SETTINGS, send_sms, getBinder, get_accounts, recieve_sms, getcallingUid, use_credentials, manage_accounts, keyspec, duration, service, class, and many more.
Hands-on-Lab using Google Colab: Android Malware Detection with Deep Neural Network with 5-Folds Cross-Validation
Copy and paste the following link to open google colab:
https://colab.research.google.com/notebooks/welcome.ipynb
Then click File --> New notebook
Click the red box area in the website and change the file name Android Malware Detection
Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)
When you copy the code, there is a high chance that you may get indentation error. To resolve this problem please go to this link where you will get the same code but with proper indentation. Link: https://colab.research.google.com/drive/1bYwBo8ZCIxg4reMTX6E5D1nIJiQQAm_V
On the first code cell copy and paste the following code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
le = LabelEncoder()
And the click the run button (looks like play button) to run this code cell
Now we have to upload the datasets to google colab
from google.colab import files
uploaded = files.upload()
After successful execution of that cell, you should be able to see the same result like the following picture. Next step: click choose to upload the dataset into google collab from your local folder. Download the Dataset: drebin.csv
It might take few minutes to upload the file.
Next, create a new cell, copy and paste the following code and run it. The purpose of this code is to load and pre-processing the data. Display the first five observations of the data.
data = pd.read_csv('drebin.csv')
data.head()
To check the data shape, run the following code. Here the return is (15031, 216), meaning that the data contains 15031 observations (rows) and 216 attributes (columns).
To fill the missing values as np.nan format to ease further work and reset the index.
data.fillna(value=pd.np.nan, inplace=True)
data = data[(data.astype(str) != '?').all(axis=1)]
data = data.reset_index(drop = True)
The dataset contains both sets of independent variables and the dependent variable (whether the android application is malware or benign). The following code separates the dependent variable and displays the first five observations.
y = data.iloc[:,-1:]
y.head()
Since the dependent variable is in character format, it requires to be transformed into numerical form (numpy array). The following code will do that.
y = le.fit_transform(y)
The following code separates the set of the independent variable and then converts it into numpy array form.
X = data.iloc[:,:-1]
X = np.array(X)
The following code import the required model logistic regression and metrics for evaluation purpose.
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
The following code defines a function that takes an input of train data (Xtrain, ytrain comes from training data and Xtest, ytest comes from test data) and returns fbeta score, true positive rate (TPR), false-positive rate (FPR)
def mal_detection_LR(Xtrain, Xtest, ytrain, ytest):
LR_classifier = LogisticRegression()
LR_classifier.fit(Xtrain, ytrain) # Train the model
pred_prob = LR_classifier.predict_proba(Xtest)[:, 1] # Predict probabilities for the test data
pred = (pred_prob > 0.5) # Apply threshold to convert probabilities into binary predictions
fbeta = metrics.fbeta_score(ytest, pred, average='weighted', beta=10) # Calculate F-beta score
tn, fp, fn, tp = metrics.confusion_matrix(ytest, pred).ravel() # Calculate confusion matrix
tpr = tp / (tp + fn) # Calculate true positive rate (TPR)
fpr = fp / (fp + tn) # Calculate false positive rate (FPR)
return fbeta, tpr, fpr
The purpose of this step is to perform 5-fold cross-validation with stratified sampling and apply the defined function. This portion of code provides results for each of the five folds.
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cvscores_LR = []
for train, test in kfold.split(X, y):
fbeta, tpr, fpr = mal_detection_LR(X[train], X[test], y[train], y[test])
cvscores_LR.append((fbeta, tpr, fpr))
To access each of the folds results we must type:
cvscores_LR
The following code decorates the results and shows a table.
pd.DataFrame(cvscores_LR, columns= ['F-beta Score', 'TPR', "FPR"])
The following code is from defining Support Vector Machine, like the previous one
from sklearn.svm import SVC
def mal_detection_SVM(Xtrain, Xtest, ytrain, ytest):
SVM_classifier = SVC(kernel = 'rbf', probability = True)
SVM_classifier.fit(Xtrain, ytrain)
pred_prob = SVM_classifier.predict_proba(Xtest)[:,1]
pred = (pred_prob > 0.5)
fbeta = metrics.fbeta_score(ytest, pred, average='weighted', beta = 10)
tn, fp, fn, tp = metrics.confusion_matrix(ytest, pred).ravel()
tpr = tp/(tp+fn)
fpr = fp/(fp+tn)
return fbeta, tpr, fpr
kfold = StratifiedKFold(n_splits= 5, shuffle=True, random_state = 42)
cvscores_SVM = []
for train, test in kfold.split(X, y):
cvscores_SVM.append(mal_detection_SVM(Xtrain = X[train], Xtest = X[test], ytrain= y[train], ytest = y[test]))
To build a neural network, the following code will import the necessary packages.
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras import optimizers, regularizers
from sklearn import metrics
from keras.regularizers import l1, l2
from keras.callbacks import EarlyStopping, ModelCheckpoint
A neural network with one input layer, three hidden layers, and an output layer is developed in the following code.
from sklearn.model_selection import StratifiedKFold
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
def mal_detection_NN(Xtrain, Xtest, ytrain, ytest, num_epoch):
Xtrain = Xtrain.astype('float32')
Xtest = Xtest.astype('float32')
ytrain = ytrain.astype('float32')
ytest = ytest.astype('float32')
mal_classifier = Sequential()
# first hidden layer
mal_classifier.add(Dense(units=25, activation='relu',
activity_regularizer=regularizers.l2(0.000001), input_dim=215))
mal_classifier.add(Dropout(0.2))
# 2nd Hidden layer
mal_classifier.add(Dense(units=15, activation='relu',
activity_regularizer=regularizers.l2(0.000001)))
mal_classifier.add(Dropout(0.2))
# 3rd hidden layer
mal_classifier.add(Dense(units=5, activation='relu',
activity_regularizer=regularizers.l2(0.000001)))
# output layer
mal_classifier.add(Dense(units=1, activation='sigmoid'))
# defining optimizer
adam = tf.keras.optimizers.Adam()
# compiling model
mal_classifier.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
# defining callback to early stop monitoring on validation loss
monitor = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=0, mode='auto',
restore_best_weights=True)
callbacks_list = [monitor]
# fitting the training data
history = mal_classifier.fit(Xtrain, ytrain, validation_data=(Xtest, ytest), callbacks=callbacks_list,
epochs=num_epoch, batch_size=64)
# predicting on test data
pred_prob = mal_classifier.predict(Xtest)
pred = (pred_prob > 0.5)
# evaluating the model
fbeta = metrics.fbeta_score(ytest, pred, average='weighted', beta=10)
tn, fp, fn, tp = metrics.confusion_matrix(ytest, pred).ravel()
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
return fbeta, tpr, fpr
This step is to stratify cross validation with the neural network classifier mentioned above:
kfold = StratifiedKFold(n_splits= 5, shuffle=True, random_state = 42)
cvscores_NN = []
for train, test in kfold.split(X, y):
cvscores_NN.append(mal_detection_NN(Xtrain = X[train], Xtest = X[test], ytrain= y[train],
ytest = y[test], num_epoch = 100 ))
The following code is used for comparing the results of using the above three classifiers (LR, NN, and SVM) in terms of F-beta score.
fbeta_nn = pd.DataFrame([a for a,b,c in cvscores_NN], columns= ['Neural Netork'])
fbeta_svm = pd.DataFrame([a for a,b,c in cvscores_SVM], columns= ['SVM'])
fbeta_lr = pd.DataFrame([a for a,b,c in cvscores_LR], columns= ['LR'])
fbeta_all = pd.concat([fbeta_nn, fbeta_svm, fbeta_lr], axis = 1)
fbeta_all
When you copy the code, there is a high chance that you may get indentation error. To resolve this problem please go to this link where you will get the same code but with proper indentation. Link: https://colab.research.google.com/drive/1bYwBo8ZCIxg4reMTX6E5D1nIJiQQAm_V