This project plans to utilize 25 parameters case default payments in Taiwan to predict fraud. The difficult articulation depends on default payments and looks at the prescient precision of the likelihood of default among data mining techniques.
The goal of the project is to develop the predictive accuracy of a candidate machine learning algorithm for detecting defaults to clients’ credit cards. The dataset will be cleaned and subjected to exploratory analysis to determine correlations between features and relationships between features and the default decision (Yes or No). The dataset will be split into training and test sets, with the former trains on logistic regression, neural network, K-means algorithms, and then subsequently evaluated using the test set to determine the choice model.
Table Of Contents
PHASE I: Project Pitch, Literature Review, and EDA
PHASE II: EDA & Model Construction
PHASE III: Execution and Interpretation
Dataset
The dataset is collected from the Department of Information Management, Chung Hua University, Taiwan and the Department of Civil Engineering, Tamkang University, Taiwan. The source URLs for the data is
Dataset
The dataset is collected from the Department of Information Management, Chung Hua University, Taiwan and the Department of Civil Engineering, Tamkang University, Taiwan. The source URLs for the data is
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
The dataset contains thirty thousand rows and twenty-four columns. The twenty-five columns are: Id, Limit_Bal, Sex, Education, Marriage, Age, Pay_0, Pay_2, Pay_3, Pay_4, Pay_5, Pay_6, Bill_Amt1, Bill_Amt2, Bill_Amt3, Bill_Amt4, Bill_Amt5, Bill_Amt6, Pay_Amt1, Pay_Amt2, Pay_Amt3, Pay_Amt4, Pay_Amt5, Pay_Amt6, Default Payment Next Month
Methodology
I planned to implement the below project process summary to accomplish the goal:
· Data Cleaning
· Exploratory Data Analysis
· Feature extraction and engineering
· Data Visualization
· Model Training.
· Model Evaluation and Testing
Literature Review
A few studies were inquired, but of all, Yeh and Lien are the most pertinent to what I'm working on. It depend on finding money related misery from various outlined credit datasets where Bankruptcy prediction and credit scoring were the essential markers of monetary trouble forecast. An assortment of Machine Learning approaches is utilized to recognize extortion and anticipate installment defaults. A portion of the more typical methods incorporate was K-nearest neighbor classifier, Logistic Regression, Discriminant Analysis, Naïve Bayesian Classifier, Artificial Neural Networks and Classification Trees.
The two major findings from this literature are as follow:
1. Is there any difference in classification accuracy among the six data mining techniques?
2. Could the assessed probability of default delivered from data mining strategies signify the genuine probability of default?
From the analyses, it was indicated that the methodology anticipated default account ahead of time, which is cost-productive for the financial association and also approve it with numerous datasets.
Key Takeaway
The six major classification techniques in data mining (K-nearest neighbor, Logistic regression, Discriminant Analysis , Naive Bayesian, Neural networks and Classification trees) and its application on credit scoring were carefully considered in the paper.
The paper also compares the performance of classification and predictive accuracy of probability of default among the six techniques.
Artificial Neural Networks (ANN) was discovered to carry out classification more correctly than the other five methods. So it had better be use to score clients instead of the other techniques.
Preliminary Exploratory Data Analysis (EDA)
From the preliminary exploratory data analysis results, I understood my dataset and also know the datatype for each column and size. I deduced that I have no missing data in the file. I have the plan to review correlations among variables, and played around with visualizations and visualize subsets. I will endeavor to utilize some of these analyses that I mentioned to further develop and improve the Machine Learning Approach models in which one of the algorithms will be considered in the Long run.
Exploratory Data Analysis (EDA)
The purpose of exploratory data analysis it to perform pattern discovery analysis on data using summary statistics and graphical representations. Here, I investigated the data by checking for data unbalancing, perform correlation analysis, visualized the features and understanding the relationship between different features, and prepare the data for model construction. Also, I plotted a correlation matrix to show features with the highest absolute correlation.
Observations
All the data types are all int64.
There is no missing data in the entire dataset.
From my analysis, the result shows that we are dealing with imbalance classes.
I discovered some outliers in the features so I have to remove 1% of the features which will be used to build the algorithms.
Fig1
Frequency Plot for Default Payment
About 22% of the credit card clients default in their next monthly payment, while the remaining 78 % tend to meet their monthly credit card payment obligation.
Fig2
Correlations among variables
Based on the correlation matrix, there is a strong positive correlation between any two monthly payment status: in all such cases, the correlation coefficients were between 0.47 to 0.82. For instance, the payment status in September tends to increase as that of May (correlation coefficient, R = 0.51), similarly the payment status in April correlated positively with that of August (R = 0.58).
In similar vein, there is a stronger positive correlation between any two monthly amounts of bill statements, with correlation coefficient values ranging from 0.80 to 0.95. For instance, the amount of bill statements in September tends to strongly increase as that of July (R = 0.89). Likewise, the amount of bill statements in June was strongly and positively correlated with that of May (R = 0.94).
Among all the variables, only the payment status variables (PAY_0, PAY_2, PAY_3, PAY_4, PAY_5 and PAY_6) correlate with targeted variable (default payment ). However, each of the payment status variables has a very weak positive correlation with default payment, with R values ranging from 0.19 to 0.32.
Fig3
Distribution Plot for PAY_AMT6
The distribution plot for the amount of previous payment is April 2005 showed the most of the payment amount ranges between $0 to $ 30,000. Other higher amounts are present but at a much lesser count.
Model Construction Methodology
Normalize dataset to eliminate effect of outliers
Split dataset into training and test sets.
Run Logistic Regression, Random Forest, Decision Tree, KNeighbors and
Gaussians algorithms.
Test model.
Check accuracy.
Model trained with Non-standardized Features
Model trained with standardized Features
Precision-recall curves for DecisionTree
Precision-recall curves for Logistic Regression
Precision-recall curves for KNeighbors
Precision-recall curves for GaussianNB
Precision-recall curves for RandomForest
Final Presentation
What could be done better?
The training model in this project would be improved by tuning the hyperparameters of the algorithms such as L1/L2 regularization terms, learning rates across a parameter grid using GridSearchCV.
The development of learning curves for each of the five models to help identify the presence of high bias, high variance or both for further insights on how to improve the learning algorithms.
The use of more high-quality data and feature engineering.
The tuning of each classifier decision threshold (a default value of 0.5 was employed in this study).
Conclusion
Based on the auc_roc_score and f1_macro values, the Random Forest classifier was observed to be the best performing model among the five studied models for both standardized and non-standardized features.
With a roc_auc_score value of 0.76 on a decision threshold of 0.5, the Random forest classifier trained model is good, in terms of the predicted probability for distinguishing whether a consumer will default in his/her monthly credit card payment the given classes.
The micro-averaged f1_score (82%) for the choice model (Random Forest classifier), shows relatively better predictability with potential for improvement of the learning algorithms.
References
1. Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural
network rule extraction and decision tables for credit-risk evaluation.
Management Science, 49(3), 312– 329.
2. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., &
Vanthienen, J. (2003).
3. Benchmarking state-of-the-art classification algorithms for credit scoring.
Journal of the Operational Research Society, 54(6), 627–635.
4. Berry, M., & Linoff, G. (2000). Mastering data mining: The art and science of
customer relationship management. New York: John Wiley & Sons, Inc.