Hands-on Lab Practice

The dataset in this module is used for credit card fraud detection.

The dataset has these attributes: time, newly generated 28 features using principal component analysis, and amount.

In this Module, we will be implementing Logistic Regression for financial fraud prediction in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following:

TIME: This field represents the Number of seconds elapsed between this transaction and the first transaction in the dataset of the Credit Card Fraud dataset

V (1-28): V represents the "may be" result of a PCA Dimensionality reduction to protect user identities and sensitive features (v1-v28)

Copy and paste the following link to open google colab

https://colab.research.google.com/notebooks/welcome.ipynb

Then click File --> New notebook

Click the red box area in the website and change the file name to Logistic Regression for financial fraud prediction.ipynb

Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)

On the first code cell copy and paste the following code to upload the dataset to google colab

from google.colab import files

uploaded = files.upload()

And the click the run button(looks like play button) to run this code cell

After successful execution of that cell you should be able to see the same result like the following picture. Next step, click choose upload fraud credit card dataset(creditcard.csv) into google colab.

This is the link for credit card dataset. You need to agree with the terms and conditions, and require a Gmail account.

https://www.kaggle.com/mlg-ulb/creditcardfraud/download

It might take about 10 mins to upload this file because this dataset is extremely large(284808 sample provided).

Next, Create a new cell, copy and paste the following code and run it.

The purpose of this code is to receive the data from the CSV file and split the dataset into the training set and test set. The split rate is 0.25 (it means that the training set is 75% of the whole dataset and the test set is 25% of the whole dataset)

import numpy as np

import pandas as pd

dataset = pd.read_csv("creditcard.csv",sep = ',')

X = dataset.iloc[:,0:30]

y = dataset.iloc[:,30]

#print(y)

y.value_counts()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25,random_state = 0)

Next, we do some feature scaling. Create a new cell, copy and paste the following code and run it.

The purpose of this step is to help to normalize the data into a particular range. It also can helps reduce the calculating time of the algorithm.

#Feature Scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.transform(X_test)

After feature scaling, we will use logistic regression algorithm to train and predict our test set.

Finally, we calculate the accuracy and making the confusion matrix.

Copy and paste the following code and then run it. The result should be similar to the following picture.

from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score

#Fitting Logistic Regression to the Training set

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0,solver='lbfgs')

classifier.fit(X_train, y_train)

#predicting the test set result

threshold = 0.1

y_pred = np.where(classifier.predict_proba(X_test)[:,1]>threshold,1,0)

#print(y_pred)

ff=pd.DataFrame(data=[accuracy_score(y_test, y_pred), recall_score(y_test, y_pred),

precision_score(y_test, y_pred), roc_auc_score(y_test, y_pred)],

index=["accuracy", "recall", "precision", "roc_auc_score"])

print(ff)

#Making the confusion Matrix

from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test,y_pred)

accuracy = np.trace(matrix) / float(np.sum(matrix))

print("Cofusion Matrix")

print(matrix)

print("The accuracy is: {:.2%}".format(accuracy))

As we can see, the accuracy is 99.94%.

Page updated

Google Sites

Report abuse