Summary
The goal of this project is to improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years. Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This project aims to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
1 Introduction
Entities apply for loans for various reasons. Individuals may apply for loans to buy a house, pay for a purchase etc. Organizations might take a loan to grow their businesses. Banks play crucial role in providing these entities with funds so that the market and society could function properly. The problem is that not every loan application can be approved. In order to make a decision on a loan application, the bank looks at the credit history of these entities. There are a number of parameters to determine if an entity is eligible for a loan. These parameters can be used to predict whether an entity would be suitable for giving a loan or not.
Predicting whether a borrower would default on his/her loan is of vital importance for bankers, as default prediction accuracy will have great impact on their profitability. Previous efforts have been made in this domain using machine learning based on different attributes.
The rest of this blog is organized as follows. Background and related work are covered in Section 2. Data collection and exploration is covered in Section 3. The methodology of machine learning analysis is outlined in Section 4. This is followed by a detailed discussion of my analysis, culminating with the final results being presented in Sections 7,8 and 9.
2 Related Work
From a machine learning perspective, the loan default prediction (or credit scoring) can be viewed as a binary classification problem. Previous approaches focuses on prediction using ensemble methods or fuzzy systems. Neural networks, successfully applied to various fields, also finds application in the default prediction problem. More recently, there have been many machine learning oriented approaches to solving this problem. A significant proportion of these approaches are based on Random Forests and combination of other ensemble methods.
The work presented here relates to loan default prediction. It involves using a set of parameters for each individual to determine if that individual is likely to default on a loan in the future. An effective way of solving this problem will help banks maximize their profits by reducing the number of defaulters.
Previous work on applying Fuzzy Simplex Generic Algorithm was based on generating decision rules for predicting loan defaults in a typical credit institution. This was performed by Oluwarotimi Odeh et al. [1] in 2011. The empirical methods show that repayment capacity and owners equity have a significant impact on the credit default status. The problem with this approach is that there are various other parameters that have not been considered when predicting the default status. Machine learning algorithms open avenues to a more comprehensive approach to predicting defaults.
Kyung-Shik Shin et al. [2] investigates an approach of applying support vector machines in bankruptcy prediction model. This mainly focuses on overcoming the limitations imposed by back-propagation neural networks that generally perform well in pattern recognition tasks. One of the disadvantages of this approach is the complexity involved in building a classifier. It is a computationally intensive task. When we try to use this approach in the real world scenario with hundreds of thousands of instances that are mostly numerical in nature, the algorithms fails to efficiently find the interaction between features and often given low accuracy and Kappa statistics (as shown in this blog).
Junjie Liang [3] devised an approach of applying Random Forests to this problem. The problem with this approach is that the data set was modified to balance the number of “good” and “bad” cases. Loan defaults data sets are highly skewed, i.e. there are many cases in one class, few in another. This imbalance of data causes many standard learning algorithms perform poorly on the minority class. Duplicating each instance of the minority class 16 times to balance the distribution is an inappropriate technique. This will force the algorithm to overfit to the minority class and the classifier will not work effectively in the real world scenario.
To date, there exists no specialized algorithm coping with both the imbalance and large data problem in loan default prediction. This project tries to solve this problem by using a Random Forests approach.
3 Data Collection
The data set used for this project is obtained from the competition titled “Give Me Some Credit” in kaggle.com. The file contains various parameters such as Monthly Income, Number of Dependents, age, number of open credit lines and loans etc. It is stored in csv format. Each parameter is described briefly below.
a)
Instance Number: This contains the instance number data.
b)
SeriousDlqin2yrs: This is of binary type. Our algorithm is used to predict this. This depicts whether a person experienced 90 days past due delinquency or worse.
c) RevolvingUtilizationOfUnsecuredLines: Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. This is in percentage.
d)
Age: Contains the age of the borrower in years. It is of integer type. This column didn’t contain any missing data.
e)
NumberOfTime30-59DaysPastDueNotWorse: This column contains number of times borrower has been 30-59 days past due but no worse in the last 2 years.
f)
DebtRatio: This field contains data in percentage form. It is obtained by dividing the sum of monthly debt payments, alimony, living costs with monthly gross income.
g)
MonthlyIncome: This column contained the information about the monthly income of an individual.
h)
NumberOfOpenCreditLinesAndLoans: This column contained information about the number of open loans such as car loans, house loans and lines of credit (ex. Credit card).
i)
NumberOfTimes90DaysLate: This column had information about the number of times an individual was late by 90 days or more in paying their bills.
j)
NumberRealEstateLoansOrLines: This column contained information about number of mortgage and real estate loans including home equity lines of credit an individual have taken.
k)
NumberOfTime60-89DaysPastDueNotWorse: This field contains the information about the number of times borrower has been 60-89 days past due but no worse in the last 2 years.
l)
NumberOfDependents: This column contained information about the number of dependents in the family excluding themselves.
4 Procedure Outline
The objective of this project is to train a classifier that can predict if an individual will experience financial distress in the next two years given the set of attributes listed above. This section outlines that process. Data preparation and feature selection are outline in Section 5; this details the process of creating and selecting features. This section also discusses the division of data into development, cross-validation and holdout sets. In Section 6, exploratory data analysis is done on the development data set. Section 7 presents a baseline performance, using Random Forests with default settings using a 10-fold cross validation on the cross-validation data set. Based on error analysis results, feature space redesign is performed in Section 8; this includes a comparison of baseline and optimized performance. Finally, the optimized model is trained on both the cross-validation and development sets. This is then used to classify the instances in the holdout data set. The results of this are present in Section 9.
5 Data Preparation
I divided the data in three sets: the development set, used for data exploration; the cross-validation set; and the holdout set, to be used after optimization.
The data set that was obtained from kaggle.com had 150,000 unique instances. I used 15% (22,500 instances) of data for the development set and another 15% (22,500 instances) for the holdout set. The rest was 70% (105,000 instances) was the cross-validation set. The reason for making the splits at these points was to ensure that enough data was available to build an effective classifier that doesn’t overfit to the existing data and also work well on unseen data. Another reason for using a cross-validation set and not a train-test pair is due to the skewed nature of the data.
The next step was to prepare the data for use in Weka. The class column in the data set was populated using ones and zeros. As Weka cannot work on this for classification tasks, I converted them to nominal values. As the ID column in the data sets do not help the algorithm in classifying instances, it was removed. As the data was taken from real-world sources, it is expected to contain errors. I tried to identify these errors and fix them. For example, some instances had 0 as the value for Age. Such entries were replaced by the median value.
Some of the quantitative values within the dataset were actually coded values that had qualitative meanings. For example, under the column NumberOfTime30-59DaysPastDueNotWorse, a value of 96 represents “Others”, while a value of 98 represented “Refused to say”. These values need to be replaced so that their large quantitative values do not skew the entire dataset.
There were some entries in the data set that were listed as ‘NA’. For features where it made sense to impute the data by using the median value (monthly income), this technique was used. For other columns, I had to replace ‘NA’ with ‘?’ so that Weka would accept it.
6 Data Exploration
I performed exploratory data analysis using the development set. This involved manually examination of instances and seeing how the class values can be predicted. In addition, feature selection was an iterative process, closely tied to exploratory data analysis.
To gain an initial understanding of what sort of performance I might expect, I ran several machine learning algorithms in Weka to predict the class values, using ten fold cross validation on the development data set. This provided a sanity check on the changes I made to the data set and helped confirmed my choice of using Random Forests as my machine learning algorithm. After going through a couple of research papers which used this technique for such problems, I had hypothesized that Random Forests would be the best choice for my task. The results of using different algorithms are presented in Table 1. The best results were obtained using Random Forests.
The next step in my exploratory data analysis was looking at distribution of values for all attributes.
Table 1: Basics statistics of attributes before MonthlyIncome was imputed.
Below is a table with the performance data of different algorithms I tried out on the development set.
I chose to proceed with Random Forests. To understand the data better, I started off by examining the distribution of age in the development data.
Figure 1: Distribution of age
As expected, the data is approximately normally distributed with a bell curve.
Below shows how the monthly income and utilization of unsecured lines were distributed in the data set. This is done after discretization using MDL technique.
We cannot really point at one area and say there is a great distinction between the two classes. However, these bins might help different algorithms to make good decisions.
7 Baseline Performance
I performed baseline analysis using default settings in Weka for Random Forests. The models discussed in this section were built using a ten-fold cross validation on the cross-validation data set.
The following is the confusion matrix as a result of the model built using the default Random Forests settings.
The accuracy and Kappa values obtained were 0.93 and 0.22. We would ideally like to minimize the number of instances present in the bottom left corner of the confusion matrix. In other words, wrongly predicting that an individual won’t experience financial distress will incur significant losses to financial institutions.
8 Features Space Re-design
Next step was to see how to improve this baseline. I trained a new model with the feature table extracted from the cross-validation data and evaluated using the development set as a supplied test set. I went on to explore the results to see how the algorithm tries to make sense of the attributes and figure out if any change in the feature space representation could improve these results. The attribute MonthlyIncome was the most confusing feature as it had the highest “Horizontal Absolute Difference”. As this attribute had a long range of continuous values, I decided to discretize it. I made use of Fayyad & Irani’s Minimum description length (MDL) method. I extracted a new feature table and trained a new model using the discretized cross-validation set.
With this approach, I got an accuracy of 0.932 and a Kappa of 0.24. I then performed a t-test with the baseline model. The p value of 0.027 concludes that this is a significant improvement over the baseline. The updated confusion matrix is now:
This small change of discretizing one attribute alone prevents over 100 bad loans.
I then tried applying the same MDL discretization approach to other attributes. Individually, almost all of them gave a small increase in Kappa values with the accuracy remaining the same. However, when all of these were combined together, the outcome was rather disappointing. The Kappa in this case worsened.
9 Final Result
To generate a final result, the cross-validation and development sets were combined together into a single training set. The testing set was holdout, which had otherwise been ununsed. Using the default setting on Random Forests, I was able to correctly classify 93.6% of the instances with a Kappa value of 0.247. Using OneR, 93.1% of the instances were correctly classified with a Kappa value of 0.07. A t-test between these two results using Experiment in Weka tells that there is a significant improvement in both, the Kappa value as well as the percentage of correctly classified instances.
My next step was to optimize Random Forests – to find the parametric values that would result in the best performance for the algorithm on my data set. The parameter that I chose to optimize was I, the 'number of trees' parameter. The default setting for 'I' in Weka was 10. I changed this to 20 initially to check for improvements. It improved the Kappa by a whole percentage point. I was curious to find out the maximum bound to which I could increase 'I'. The CVParameterSelection in Weka can be used with the cross-validation set to find out whats the optimal setting for 'I'. Unfortunately, as the number of trees increased, time taken to build the model also increased significantly. My efforts on trying to parallelize this approach failed. Tuning the 'I' parameter is bound to increase the Kappa values significantly.
References
[1] Odeh O, Koduru P, Featherstone A, Das S, Welch SM. A multi-objective approach for the prediction of loan defaults. Expert Syst Appl. 2011
[2] Kyung-Shik Shin, Taik Soo Lee, Hyung-Jung Kim. An application of support vector machines in bankruptcy prediction model. Expert Syst Appl. 2005
[3] Junjie Liang. Predicting borrowers’ chance of defaulting on credit loans. Stanford University, 2011