Learning Objectives:
Financial Fraud
Financial fraud involves credit card transactions, insurance claims, tax return claims, and many others and detecting and preventing fraud is not a simple task. Fraud detection is a classification problem to predict a discrete class label output based on a data observation such as Spam Detectors, Recommender Systems, and Loan Default Prediction.
Classification data analytics with machine learning can be used to tackle fraud, to effectively test, detect, validate, and monitor financial systems against fraudulent activities.
For credit card payment fraud detection, the classification analysis uses intelligence to classify legit or fraudulent transactions based on transaction details such as amount, merchant, location, time and others.
Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis investigates and estimates relationships between two and more relevant variables which can help understand and identify relationships among variables and make a prediction.
Financial fraud hackers are always finding new ways for their attacks. Relying exclusively on traditional and conventional methods for detecting such fraud would not provide an effective and appropriate solution. Machine Learning can provide a unique solution for fraud detection.
Logistic Regression
Logistic regression is a classical classifier of supervised learning, which is often used in data mining, diseases diagnosis and economic prediction. The output of logistic regression can predict the probability of a class.
Types of Logistic Regression:
Binomial Logistic Regression:
Multinomial:
Ordinal:
In this lab, we will focus on binomial Logistic regression.
Sigmoid function:
The sigmoid function can take any real number and map it into the value range from 0 to 1. If a number closes to positive infinity, the prediction of y is 1, and if a number closes to negative infinity, the prediction of y is 0. If the output is greater than threshold (usually is 0.5), we regard this output as 1 or labeled it as"yes"; If the output is less than threshold, we regard this output as 0 or labeled it as"no";
Imbalanced Datasets
Imbalanced datasets typically refer to problems with classification problems where the distributions of classes are not equal. Imbalanced datasets are common in our daily life, such as spam detection, credit card fraud, and natural disaster detection. Majority class refers to the data that has a large proportion in the examples.