This study investigated how predictable the identification of fraudulent transactions is when using SMOTE in conjunction with three common machine learning predictive analysis methods. After using SMOTE to balance the unbalanced original dataset, that is made up of 284,807 credit card transactions made in the span of two days from European cardholders in 2013, the Random Forest method, Extreme Gradient Boosting method, and Logistic Regression method were applied to the newly balanced dataset with varying hyperparameters. After determining the best hyperparameters for each method it was found that the Random Forest method’s highest accuracy produced was 97.47%, the Extreme Gradient Boosting method’s highest accuracy produced was 94.67%, and the Logistic Regression method’s highest accuracy produced was 98.16%. This study showcased that the SMOTE method is effective in balancing unbalanced datasets, see Table 2, and that all three studied methods, when used with a SMOTE balanced dataset, are highly accurate, with the Logistic Regression method being the most accurate of the three studied methods. In the future, more research is needed to determine which attributes of the transaction contribute to a good prediction.
John Mayer is a current senior, graduating with majors in both Accounting and Business Technology Management. John is from Cleveland, Ohio and is very interested in the forensic field relating to accounting and analytics. Upon graduation in May, John will begin preparing for the CPA and will start with Ernst & Young in September as an associate in Forensic and Integrity Services.
Dr. Abhimanyu Gupta had an extreme influence on this project as he taught me the methods necessary to pursue this research outside of class, while also helping me to build a working model to test my theory. His constant support and guidance was invaluable to the success of my independent study!