Project Objective

The objective is to build a robust, efficient, scalable, and accurate machine learning model that can detect fraud in credit card transactions.

Audience

This project will be beneficial to every credit card-holder and credit card companies, government, data scientists, researchers especially those in fraud detection domain, machine learning engineers, software developers, academics, financial sectors, and others.

Data

The credit card fraud dataset is downloaded from Kaggle, a website for data science projects. Download the data here . The data is an imbalanced data having a total of 284,807 transactions. There are 284,315 normal transactions and 492 fraudulent transactions.

Models Explored

Machine learning algorithms explored in this project are listed below (3 supervised learning models and 3 unsupervised learning models)

1. Logistic regression

2. Decision trees

3. Random Forest Classifier

4. Isolation Forest

5. Local Outlier Factor

6. One-Class Support Vector Machine Classifier

Abstract

Fraud is a critical issue in our society today. Loses due to payment fraud is on the increase as ecommerce keeps evolving. Organisations, government and individuals have experienced huge loses due to payment. Merchant Savvy projects that global loses due to payment would increase to about $40.62 billion in 2027 . Among all payment fraud, credit card fraud results in higher loss. Therefore, we intend to leverage the potentials of machine learning to deal with the problem of fraud in credit card and can be generalised to other fraud types. This paper compares the performance of logistic regression, decision trees, random forest classifier, isolation forest, local outlier factor, and one-class support vector machines based on their AUC and F1-score. A further experiment was done by applying smote technique to handle imbalance nature of the data and the performance of the supervised models on the oversampled data was compared to their performance on the raw data. From the results, Random Forest classifier outperformed the other models with a high AUC score and good f1-score on both the raw and oversampled data. Oversamplying the data didn’t change the result of the decision trees.One-class SVM performs better than isolation forest in terms of AUC score but, has a very low f1-score when compared to isolation forest. Local outlier factor had the poorest performance.

Conclusion and Future work

Credit card fraud is an important problem that calls for efficient solutions. Having an efficient solution will drastically reduce the loss incurred by government, companies and individuals. In this paper, we compared both supervised and unsupervised models, Logistic regression outperformed all the other models based on the AUC, followed by Random forest Classifier, Decision trees, One-class support vector machine, Isolation forest and Local Outlier factor.

Oversampling the data didn’t improve the performance of the model on unseen data. Isolation forest and One-class support vector machines were originally expected to outperform others but, surprisingly, supervised models outperformed them. This might be as a result of challenges with the data, a better well-informed data might improve the models generally. This is challenging because of confidentiality of customer’s information.

Random forest classifier performed better in constrast to other models for the raw data and oversampled data

We will try to obtain a well-informed data or use a synthetic data that imitates a real-life transaction. Further tuning of the models might improve the performance of the models.

Also, we will explore other unsupervised models on a better data set.

Appreciation

Our profound gratitude to the Professor, TAs, and classmates