After completing this learning module, students will be able to:
Describe a real world case study on credit card fraud transaction
Explain why regression and its variation are useful for detecting anomaly
Apply logistic regression to analyze and detect fraudulent transactions
Financial Fraud
Financial fraud involves credit card transactions, insurance claims, tax return claims, and many others and detecting and preventing fraud is not a simple task. Fraud detection is a classification problem to predict a discrete class label output based on a data observation such as Spam Detectors, Recommender Systems, and Loan Default Prediction.
Classification data analytics with machine learning can be used to tackle fraud, to effectively test, detect, validate, and monitor financial systems against fraudulent activities.
For credit card payment fraud detection, the classification analysis uses intelligence to classify legit or fraudulent transactions based on transaction details such as amount, merchant, location, time and others.
Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis investigates and estimates relationships between two and more relevant variables which can help understand and identify relationships among variables and make a prediction.
Financial fraud hackers are always finding new ways for their attacks. Relying exclusively on traditional and conventional methods for detecting such fraud would not provide an effective and appropriate solution. Machine Learning can provide a unique solution for fraud detection.
Logistic Regression
Logistic regression is a classical classifier of supervised learning, which is often used in data mining, diseases diagnosis and economic prediction. The output of logistic regression can predict the probability of a class.
Types of Logistic Regression:
Binomial Logistic Regression:
The target variable can only have 2 types: "0" or "1"(usually)
Multinomial:
The target variable have at least 3 types but without ordered: "Red" or "Blue" or "Green"
Ordinal:
The target variable with ordered: "bad" or "normal" or"good" or "excellent"
In this lab, we will focus on binomial Logistic regression.
Sigmoid function:
The sigmoid function can take any real number and map it into the value range from 0 to 1. If a number closes to positive infinity, the prediction of y is 1, and if a number closes to negative infinity, the prediction of y is 0. If the output is greater than threshold (usually is 0.5), we regard this output as 1 or labeled it as"yes"; If the output is less than threshold, we regard this output as 0 or labeled it as"no";
This is a simple dataset we will use for implementing the Logistic regression algorithm.
We create a virtual data for exercise.
Copy and paste the following code and run it on co-lab
import numpy as np
import pandas as pd
X =[[34.62365962451697,78.0246928153624],
[30.28671076822607,43.89499752400101],
[35.84740876993872,72.90219802708364],
[60.18259938620976,86.30855209546826],
[79.0327360507101,75.3443764369103],
[45.08327747668339,56.3163717815305],
[61.10666453684766,96.51142588489624],
[75.02474556738889,46.55401354116538],
[76.09878670226257,87.42056971926803],
[84.43281996120035,43.53339331072109],
[95.86155507093572,38.22527805795094],
[75.01365838958247,30.60326323428011],
[82.30705337399482,76.48196330235604],
[69.36458875970939,97.71869196188608],
[39.53833914367223,76.03681085115882],
[53.9710521485623,89.20735013750205],
[69.07014406283025,52.74046973016765],
[67.94685547711617,46.67857410673128],
[70.66150955499435,92.92713789364831],
[76.97878372747498,47.57596364975532],
[67.37202754570876,42.83843832029179],
[89.67677575072079,65.79936592745237],
[50.534788289883,48.85581152764205],
[34.21206097786789,44.20952859866288],
[77.9240914545704,68.9723599933059],
[62.27101367004632,69.95445795447587],
[80.1901807509566,44.82162893218353],
[93.114388797442,38.80067033713209],
[61.83020602312595,50.25610789244621],
[38.78580379679423,64.99568095539578],
[61.379289447425,72.80788731317097],
[85.40451939411645,57.05198397627122],
[52.10797973193984,63.12762376881715],
[52.04540476831827,69.43286012045222]]
y = [0,0,0,1,1,0,1,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,0,0,1,1,1,0,0,0,1,1,0,1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20,random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score
classifier = LogisticRegression(random_state = 0,solver='lbfgs')
classifier.fit(X_train, y_train)
#print(X_train.shape)
#predicting the test set result
threshold = 0.5
y_pred = np.where(classifier.predict_proba(X_test)[:,1]>threshold,1,0)
#Making the confusion Matrix
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test,y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
The result should be similar to the following picture
Imbalanced Datasets
Imbalanced datasets typically refer to problems with classification problems where the distributions of classes are not equal. Imbalanced datasets are common in our daily life, such as spam detection, credit card fraud, and natural disaster detection. Majority class refers to the data that has a large proportion in the examples.