Anna Soloveva - LendingClub Loan Data Analytics

LendingClub Loan Data Analytics

The Goal of the Project & Business Understanding

This project has been realized by applying machine learning techniques to LendingClub loan data. LendingClub is a peer-to-peer lending platform which has been bringing borrowers and lenders together since 2007. It offers various loan products to interested parties through their proprietary technology platform. The platform automates main aspects of the borrowing process such as data and application processing, decision generation, loan funding and compliance with regulations. The borrower uses a fast and easy to navigate online app in order to apply for the loan, make payments and monitor their borrowings.

Along with loan products, LendingClub offers investment products for interested investors. They offer access to a consumer credit asset class through whole loan sale, securitization, CLUB Certificates, and notes. The scope of this project is to build machine learning models to predict the future interest rate and loan grade.

Data Understanding

The LendingClub has loan data available publicly to investors on their website. For this project, the data ranges from first quarter of 2019 to third quarter of 2019. Original data has 664,031 observations and 150 variables. However, for the sake of time, only 10,000 loan applications were used.

Some of the explanatory variables are Loan Amount, Funded Amount, Term, Interest Rate, Installment, Grade, Sub Grade, Employment Status, etc. Figure 1 represents an average interest rate by loan grade. The average interest rate increases from 7 to 30 percent as grade decreases from A to G. Figure 2 shows an average loan amount by loan grade. It could be seen that loan grade B has the highest average loan amount followed by grade A.

Figure 3 shows the relationship between average revolving utilization of the loan by grade. It can be observed, the higher the revolving utilization percentage the lower the loan grade. Figure 4 shows the correlation between average income and grade. The plot shows that the greater the income, the higher the grade level. For example, letters starting at the beginning of the alphabet generate a higher income in comparison to grade C and D that have a lower income.

Data Preparation

The first step during the data preparation phase was to append the loan appplication data for three quarters of 2019. Next, the data was filtered out to get the first 10,000 observations. Important to note, the mentioned filtration method might not be the best way to sample the data; however it does not interfere with the goal of the project. Afterwards, the original data set is removed to empty the space in the R environment. In the next step, regression variables were changed to the appropriate data type. For instance, interest rate and revolving credit utilization variables were converted from percentages into the numeric values. Finally, for KNN and Neural Networks, we removed observations with “NA” values in both the training and the test datasets.

Data Modeling & Assessment

The three types of machine learning techniques (Linear Regression, K-nearest neighbours, Artificial Neural Network) with different configurations have been built during the project in order to predict interest rate and grade of the loan. While Linear Regression and ANN were used to predict interest rate for a loan application, KNN was used to classify the application by specific grade.

Linear Regression and ANN were assessed by comparing testing MSE. Therefore, Linear Regression Model 3 with MSE of 0.0015 could be compared with ANN Model 3 with MSE of 0.0025 as they have similar explanatory variables predicting the interest rate. The results imply that the Linear Regression model in this case has a lower MSE than Artificial Neural Network meaning a better prediction ability. As KNN models have a different target goal - grade, we cannot compare them to Linear Regression or ANN models. Among KNN models with different configurations, Model 4 with K value of ten and four explanatory variables, has the highest accuracy among three models which implies a higher ability to classify the loan application into an appropriate grade.

Authors: Anna Soloveva, Hayley Perry, Volodymyr Medin

References

About Us: Save with LendingClub. (n.d.). Retrieved from https://www.lendingclub.com/company/about-us
Harrison, O. (2019, July 14). Machine Learning Basics with the K-Nearest Neighbors Algorithm. Retrieved from https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-al gorithm-6a6e71d01761
Python. (2019, September 19). neuralnet: Train and Test Neural Networks Using R. Retrieved from https://datascienceplus.com/neuralnet-train-and-test-neural-networks-using-r/
john@hranalytics101.com. (2020, April 18). Tutorial: How to Assess Model Accuracy. Retrieved from https://www.hranalytics101.com/how-to-assess-model-accuracy-the-basics/

Page updated

Google Sites