Conclusion

CS 109A | Lending Club | Group # 26 | Fall 2018

Summary

Lending Club is a peer-to-peer lending network that connects borrowers to investors and facilitates the payment. Investors review the loan applications, and make decisions based on risk analysis provided by Lending Club's bucket/grade strategy, or their own strategies. Our project aims to define one such strategy which is superior to that of Lending Club.

We began to design an investment strategy that predicts whether a given loan will lead to "Charged Off" or "Fully Paid". To achieve this, we divided our task into 3 phases, data description, EDA and modeling. With intense literature review as outlined, we selected a subset of relevant features. Then using the different machine learning algorithms learnt in the course CS109A, we have built multiple models such as logistic regression, ada boosting, random forests, decision tree classifier, kNN, multi-layer perceptron, support vector machine and many more models. Then to verify our results, we chose the correct set of metrics used in a classification problem such as precision, recall, F1 score, area under ROC curve, log-loss and cross validation accuracy score. Random Forest Classifier happens to be the best performer on a "loan_status" stratified data set with area under ROC as 93.63%

Then we computed a function "simulate_strategy" in the results section, fits a predictive model with a subset of data depending on the investment criteria interested by the investor (conservative/speculative and term). The output displays the accuracy of our predictions, an estimation of the return on investment and data set with loans selected by our model. We also displayed an ROI comparison of our strategy vs the default strategy offered by Lending Club.

At first, all models were tested on a 10% data set. Later, we tested our final model, random forest classifier and simplest model, logistic regression on AWS with full data set. Here, we tested for correlations between loan features and census data (see the area marked with a rectangle with black borders in the below plot) and checked for statistical parity in our model. The correlation is very low; almost non-existent in our model. There is no discriminating with census data, but a few features, namely, "Native_pct", "Asian_pct" and "poverty_level" showed little statistical parity on fairness. With more detailed census data and greater computational power to run the full dataset , the analysis can be continued as future work. On a similar note, we verified if Lending Club is the site of possible discrimination or unfair lending practices. The correlation is very low; almost non-existent for Lending Club model. This led us to conclude that Lending Club is fair in its investment strategy.

Our model vs Lending Club model

Our model takes loans which are predicted to be fully paid to calculate the ROI and creates a continuous model for assessing loan risk. While the Lending Club model takes all loans(five levels of loan_status variable) to calculate ROI and creates a grade/sub-grade model for assessing loan risk. We achieved an ROI (15.68), which is slightly lower than Lending Club ROI (16.48). Both models have almost negligible correlation between loan and census features.

Lending Club Model

Correlation between census and loan features

Our Model

Correlation between census and loan features.

Conclusion

In our investment strategy, we had the base model as Random Forests, which is also our Final model. In order to assess which is our best model we chose the following metrics:

Cross Validation accuracy score: is one of the effective ways to assess a model and its generalization power using an independent data set. Closer it is to 100% should give us a good model, though a 100% on training set can mean overfitting.
F1 score: is the harmonic mean between precision and recall. It is used as a statistical measure to rate performance, where it reaches its best value at 1 and worst at 0
Area under ROC curve: Accuracy of the test depends on how well the test separates the group being tested into those with “Charged Off” and “Fully Paid” in question and measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test.

Random Forest Classifier is the best performer with CV accuracy of 85.76%, Area under ROC of 93.63% and F1 score of 82.28%

Besides, Logistic regression is a modeling technique borrowed from statistics. It is the first and simplest algorithm that one could start with and we can use the performance of this model for future complex model bench marking purpose. We see 16% increase in CV accuracy score and 0.19 increase in Area under ROC curve from the simplest model to our final model.

We conclude that our continuous model of investment strategy is better than the grade/bucket strategy of Lending Club and we have ROI greater than the average ROI.

Comparison between Logistic Regression (Simplest) model and Random Forests (Complex) model on a full data set

Risk Management

Any investment strategy has certain associated risks. Therefore, we took a few measures to minimize our exposure to some of these risks. We were initially anxious about how our model will sustain over multiple years since we built our models based on the 2007-2015 dataset provided by Lending Club. Was this snapshot in time indicative of ever-changing conditions in the market?

Vik Chawla, a lead research associate at Echelon Capital Management, explained that the primary concern with changing market conditions is that this industry has not been around during a market cycle. We have limited data about the sustainability of the model in the event of a major downturn in the market. As a result, we are focusing primarily in 36-month loans, rather than the 60-month loans. The shorter loans decrease the probability that they will overlap with a downturn in the market minimizing our exposure to this uncertainty as much as possible.

Future work

Due to brief time frame of this project, additional areas for further exploration and expansion remain. We propose the following sections of future work:

Acquisition of larger predictor set for rejected loan applications

We trained our model using a somewhat limited set of 9 features from the rejected loan dataset. Many predictors available in the accepted loans dataset (including, for instance, home ownership status, number of credit inquiries, FICO score, and total number of credit lines), which were not available for rejected loan applications. It would be good to request a rejected loan applications dataset containing all of these predictors from Lending Club.

More robust validation of hyperparameters with greater computational power

We trained and tested our model taking a random 10% sample of the full data set. Due to lack of computational power we ran just our final model on the full data set and could not run all our models. It would be great to explore our entire work on a full dataset using AWS or something more powerful.

More appropriate census data matching loan applications

The census data we used is an average for each zip code. For instance, if a loan has zip code 123456, and census says in 123456 has 39% male; then we can only say that there is .39 chance that the requester was male. We do not know for sure if the loan applicant was a male or not. Similar is the case with other census features used for depicting fairness in our project. When we plot the ratio of rejected applications over accepted loans on a map of the US. North Dacota, Tennessee, Idaho, Nevada, Maine states had a considerably high rejects ratio compared to the remaining states. It will be interesting to investigate this discrepancy under the light of a more detailed census data.

Google Sites

Report abuse