Fitsum Desalegn

f72@umbc.edu

Credit card fraud detection by machine learning and ANN

Summary

Credit cards are one of the most targets of fraud but not the only one. It can occur with any type of credit products such as personal loans, home loans, and retail. For this credit card fraud detection, I am going to use the transactions of the credit card as a dataset. From a perspective, it can be argued that banks and credit card companies should attempt to detect all fraudulent cases.

Introduction

Saving people income from any fraudulent activity is one of the most important things in the present day. Also, I hope to detect credit card fraud for many people, banking sectors, and online retailers. This model detecting whether the transaction is fraud or not that can help in saving a vast amount of money and security. The project will be a chance for me to perform and apply data mining analysis on real-world credit cards datasets.

The Goal

The main aim of this project is to figure out how to execute machine learning and AI-based on financial datasets. So far, I have learned supervised and unsupervised machine learning algorithms and ANN, and, now this is the time to see that practical implementation on credit card fraud detection by using some selective algorithms to find out the best accuracy result before the transaction is approved.

GitHub Repository

Data Sources

The Credit card fraud data gathered for this project was collected from the Kaggle web site, which was saved in CSV format.(https://www.kaggle.com/mlg-ulb/creditcardfraud) The aim is to discover a mere 492 false from 284,807 transactions in total. The datasets are exceptionally unequal, the positive category (fraud) represent zero. It contains only numerical input variables that area unit the results of a PCA transformation.

Methodology

I plan to separate fraud and non-fraud transactions by obtaining a decision boundary in the feature space defined by input transactions. Each transaction can be represented as a vector of its feature values. I have built binary classifiers using artificial neural networking, Logistic regression, and Random Forest through python programming language.

DELIVERABLE II LITERATURE STUDY AND EXPLORATORY DATA ANALYSIS

LITERATURE STUDY

There are many studies done on anomaly detection with IDA, LDA and PCA. Multiple models with good accuracies based on ANN, Logistic regression and Random forests are already present on the given problem.

http://www.questjournals.org/jrhss/papers/vol8-issue2/B08020411.pdf

https://ieeexplore.ieee.org/document/8123782

https://pdfs.semanticscholar.org/0419/c275f05841d87ab9a4c9767a4f997b61a50e.pdf

In this study, Credit card fraud detection is a typical uncertain domain, where potential fraud incidents must be detected in real-time and tagged before the transaction has been accepted or denied. The inclusion of uncertainty aspects impacts all levels of the architecture and logic of an event processing engine. This enables the implementation of event-driven applications possessing uncertainty aspects from different domains in a generic manner. the preliminary results are encouraging, showing potential benefits that stem from incorporating uncertainty aspects to the domain of credit card fraud.

In the other study, the research investigates that machine learning like Naïve Bayes, Logistic regression, Random forest with boosting and shows that it proves accurate in deducting fraudulent transactions and minimizing the number of false alerts. Supervised learning algorithms are novel ones in this literature in terms of the application domain. If these algorithms are applied to the bank credit card fraud detection system, the probability of fraud transactions can be predicted soon after credit card transactions. And a series of anti-fraud strategies can be adopted to prevent banks from great losses and reduce risks. By comparing all the three methods, they found that a random forest classifier with a boosting technique is better than the logistic regression and naïve Bayes methods. Also,

one of the other sides, the researcher implemented the fraud detection technique used by VISA and MasterCard. Also, the study shows up that the Neural network is the latest technique that is being used in different areas due to its powerful capabilities of learning and predicting. Also, the study tries to use this capability of neural network in the area of credit card fraud detection as we know that Backpropagation Network is the most popular learning algorithm to train the neural network so in this paper BPN is used for training purpose and then to choose that parameter that plays an important role to perform neural network as accurately as possible.

Exploratory data analysis

Since nearly all predictors have been unidentified, I decided to focus on the non-unspecified predictor's time and amount of the transaction during my EDA. The distribution of the financial value of all transactions is heavily unbalanced. Most transactions are relatively small, and only a tiny fraction of transactions come even close to the maximum amount.

Observing the data summary

Prepared Data: A process to gather context to the input data. Understanding the data for pre-processing and cleaning of datasets.
The two columns “amount” and “time” were not normalized. The remaining columns were normalized using Principal Component analysis.
Oversampling (Using SMOTE): The fraud transactions are 492 samples which is unbalanced.
Training and Testing Subset: As the dataset is imbalanced, many classifiers show bias for majority classes.

Observing features through line plots

on anonymous_feaures it was observed that the dataset consists of a total of 284807 entries. Columns were named from V1 – V28 and were anonymized due to the sensitive nature of the dataset.

It can be observed from the below histogram plot of the amount that the majority of the amount transacted lies between 0 and 5000 even though other entries are present, they are very low in the count.

The above histogram of the class predicted variable shows a very high imbalance in the predicted value. Class 0 (Normal transaction) has more than 250000 values, as compared to 492 Class 1(Fraudulent transactions)

I create a balanced training data set, I took all of the fraudulent transactions in the data set and counted them. Then, I randomly selected the same number of non-fraudulent transactions and concatenated the two. After shuffling this newly created data set, I decided to output the class distributions once more to visualize the difference.

Outcomes

● Exploratory data analysis was performed with multiple plots on the relation of input features and their distribution.

● Data preprocessing was performed with Pandas

● A subset of the actual data of 250000 entries was created by taking a window of entries across the fraudulent transactions. A window size of 64 was chosen and can be changed for further studies.

● The data contains only numerical input variables with non missing values

Feature work

● Logistic regression, random forests and ANN will implemented to predict credit card fraud detection datase and carried out the best accuracy result before the transaction is approved.

Deliverable III Execution and Interpretation

The project consists of a comparative study of Logistic Regression, Random forests and ANNs to understand the performance of each of them on credit card fraud detection. For this project, I implemented logistic regression, random forests, and convolutional neural networks. Logistic regression and random forests are implemented using the sklearn library and ANNs are implemented with Pytorch library. Convolutional neural network because CNN is the special family of ANN used to create the model. Besides, the model is highly scalable with the amount of data that is fed into the system. Besides, convolutional neural networks (CNN) are usually applied on 2-dimensional image data for feature extraction and prediction in deep learning, but a 1dimensional variant of the same can be used in understanding trends in time series and financial data as the problem statement at hand.

A logistic regression model was created and trained on the dataset created.The following implementation of logistic regression was used as it has an easy interface for binary logistic regression:

Sklearn logistic regression

L2 loss was applied over the logits to find the optimal logistic regression coefficients.

Random forest model was created and trained on the dataset created with a total of 10 decision trees in the forest.

The implementation of logistic regression was used as it has an easy interface for the random forests

● Sklearn random forests

The following diagram shows one of the decision trees generated by the random forest algorithm

Artificial neural networks

A robust convolutional neural network(CNN) was implemented with the help of Pytorch to solve the problem.

It can be seen that there is a total of 29 layers with approximately 3500 trainable parameters that make up the neural network. A graphical representation of the network is given below.

Performance Of the model

Logistic Regression

The classification report for logistic regression is given a good accuracy result as we can observe. Also the ROC. indicate a good performance.

Random Forest

The Random forest algorithm pretty similar to Logistic Regression, it very well may be noticed that even though the accuracy of the model is high (~ 100%) the review score for deceitful cases is truly low (~ 68%). It performs better with many preparing information, likewise, the speed during testing and application is acceptable.

Artificial neural networks

Neural networking, likewise, at last, arrives at the exactness of 95% on the dataset with an extremely high recall score of 90%, anyway it took an hour to get the outcome. In future work, I intend to improve the exactness, and handling season of the budgetary extortion process continuously joined with both AI-based procedure and profound fake neural systems.

The logs from the last iteration of training are given below:

The figure shows, training loss across the different epochs are plotted in the figure above.

Figure show, the test accuracies for the different k-fold splits across different epochs are plotted

Model Output Result Execution

As part of my project proposal, I am very interested because it seems almost all output comes to the concrete result. Because all three models result in shows very high accuracy results during the mode implementations time, however, CNN is very slow by time to execute the final out accuracy result.

Logistic Regression, it saw that even though the precision of the model is high (~99%), the review score for false cases is entirely low (~ 62%). This is an immediate impact of the substantial class irregularity present in the informational collection that at long last shows as the powerlessness to comprehend the minority class.

Model Result Clarification Which Differently Done

The following improvements can be done on the existing project:

✓ Different loss functions for logistic regression can be implemented

✓ The number of decision trees can be increased to see the performance of random forests.

✓ The CNN architecture and learning strategy can be experimented with to provide further insights.

✓ A random up/down sampling of the complete dataset can be done to handle class imbalance before training.

✓ Logistic regression provides a very simple, fast, and highly interpretable model for predicting credit card fraud detection.

✓ Random forests provide a better model with medium interpretability and reasonable metrics (accuracy, precision, and recall) at a marginal computation resource.

✓ Artificial neural networks give a complicated model at par with random forests but can be fine-tuned to any extent with very low interpretability. Best recall scores but comes at a computational cost.

Conclusion

The result that has been concluded that for the implementation of Logistic regression provides a very simple, fast, and highly interpretable model for predicting credit card fraud detection. Random forests provide a better model than anyone with medium interpretability and reasonable metrics (accuracy, precision, and recall) at a marginal computation resource. The result that has been concluded that for the implementation of logistic regression, it observed that even though the accuracy of the model is high (~ 99%), the recall score for fraudulent cases is (~ 62%). This is a direct effect of the heavy class imbalance present in the dataset that finally manifests as the inability to understand the minority class. Random forest is very similar to logistic regression, it can be noted that even though the accuracy of the model is very high (~100%), the recall score for fraudulent cases is low (approx 68%). This is a direct effect of the heavy class imbalance present in the dataset, which finally manifests as the inability to understand the minority class. Besides.

Artificial neural networks give a complicated model at par with random forests but can be fine-tuned to any extent with very low interpretability. Best recall scores but comes at a computational cost, also finally reach an accuracy of 95% on the dataset with a very high recall score of 97%, however, it took an hour to get the result.

In future work, I aim to improve the accuracy and processing time of the financial fraud process in real-time combined with both machine learning-based process and deep artificial neural networks.

Reference

1,Machine Learning and Deep Learning with Python, scikitlearn, and TensorFlow Third Edition – Includes Tensorlow 2, GANs, and Reinforcement Learning Sebastian Raschka & Vahid Mirjalili

2,https://www.geeksforgeeks.org/ml-credit-card-fraud-detection/

3, References1. Credit Card Fraud Detection Based on Transaction Behavior by JohnRichard D. Kho, Larry A. Vea published by Proc. of the 2017 IEEERegion 10 Conferenc e (TENCON), Malaysia, November 5-8, 2017

4,https://scikit-learn.org/stable/modules/generated/sklearn.

5,linear_model.LogisticRegression.html

6,https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.

7,RandomForestClassifier.html

8,https://scikit-learn.org/stable/modules/generated/sklearn.

9,model_selection.GridSearchCV.html

10,https://scikit-learn.org/stable/modules/generated/sklearn.tree.

11,DecisionTreeClassifier.html

12,https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.

13,VotingClassifier.html

14,https://www.datacamp.com/courses/fraud-detection-in-python

15,https://towardsdatascience.com/

16,how-dbscan-works-and-why-should-i-use-it-443b4a191c80

17,https://scikit-learn.org/stable/modules/generated/sklearn.cluster. 18,https://towardsdatascience.com/pytorch-layer-dimensions-what-sizes-should-they-be-andwhy-4265a41e01fd

19, Credit Card Fraud Detection Based on Transaction Behavior-by JohnRichard D. Kho, Larry A. Vea published by Proc. of the 2017 IEEERegion 10 Conference (TENCON), Malaysia, November 5-8, 2017.

20, Machine Learning and Deep Learning with Python, scikit learn, and TensorFlow Third Edition – Includes Tensor-low 2, GANs, and Reinforcement Learning Sebastian Raschka & Vahid Mirjalili.

21, Machine Learning Group – ULB, Credit card Fraud Detection (2018), Kaggle https://www.kaggle.com/mlg-ulb/creditcardfraud.

Page updated

Report abuse