Problem Statement
CS 109A | Group # 26 | Fall 2018
CS 109A | Group # 26 | Fall 2018
Peer-to-peer lending, often abbreviated P2P lending, is a method of debt financing that enables individuals to borrow and lend money - without official intermediary institutions, such as banks. The modern P2P lending industry began in the United States in 2006 with the birth of Prosper, followed by Lending Club and other lending platforms thereafter. Lending Club is a peer-to-peer lending network that connects borrowers to investors and facilitates the payment of loans ranging from $1,000 to $40,000 for a standard period of 36 months. Given the turbulence in the market, many investors have looked towards P2P lending as an alternative investment instrument to achieve returns. The way Lending Club works is simple: an individual or business seeking a loan completes an application with numerous predictive characteristics to Lending Club. Lending Club then uses a proprietary algorithm to approve loans and place the approved loans into "buckets" called "grade" and "sub-grade". Investors review the loan applications, and make decisions based on risk analysis provided by Lending Club, or their own strategies. Our project aims to define one such strategy which is superior to that of Lending Club.
Despite being an “Equal Housing Lender”, Lending Club is the site of possible discrimination or unfair lending practices. Studies show that the federal Fair Housing Act has failed in the past to stop discrimination with regard to race, color, religion, national origin, sex, handicap, or familial status when approving loans. Our project intends to explore if Lending Club strategy facilitates this facet of possible discrimination, through data driven predictive analysis. At the same time, we would like our investment strategy to constrain our loan status prediction model with fairness and interpretability.
Our main problem statement is to
“predicts whether loan will be "Fully Paid" or "Charged Off" , compute ROI and to constrain that model regarding fairness and interpretability without substantial losses to efficacy of the investment strategy”
Thus, to deliver a data-driven predictive model that powers an investment strategy.
The discrete grade and sub-grade model that Lending Club uses was quite intriguing to our group. While we completely understood the need for varying interest rates depending on the risk of loan default, it seemed quite arbitrary that there are lines drawn between buckets. Clearly not every loan in a bucket/grade, take grade "A2" for example, has exactly the same risk, which is why assigning them the same rate of interest doesn't quite make sense.
Moreover, the very best loans in grade "A2" and the very worst loans in grade "A1" are likely quite similar in risk; however, Lending Club has placed them into different buckets. This led our group to question why, if two loans are similar, can one have an interest rate of 5.32% while the other has a 6.49% rate for nearly the same risk. We believed that if we could create a continuous model for assessing loan risk, we could identify the safest loans in each bucket. Thus, we could get the same interest rate while decreasing our default risk, increasing returns over the basket approach provided by Lending Club.
This strategy required three primary phases: feature selection, building a predictive model and generating an ROI for loan selection. Fortunately, Lending Club publishes all their historical loan data along with the 100+ characteristics about the party receiving the loan and the loan status. The first step of the strategy, feature selection, involved identifying which of the provided characteristics were most indicative of whether or not a loan would be paid in full or charged off. The next step, creating the model, required taking those features and using multiple machine learning techniques to predict a loan's risk of default given the features selected in phase one. Last step, is the ROI generation. All subgrades of loans which have a loan status of either "Fully Paid" (True) or "Charged off" (False) are selected. 15.68% of the loans in the dataset are classified as False (Charged off).The goal is to select loans such that the failure rate is less than this threshold. We may reject many loans which are good loans; however, we do not want to select loans that are bad ones.