Exploratory Data analysis

Description of Data

LendingClub has publicly shared extensive data on all loans issued from 2007-2018. This data includes information on the status of each loan, for example if the loan is fully paid or charged off. The data does not include information on the demographic details of the individuals issued loans, apart from a 3-digit zip code for each applicant. By combining this 3-digit zip code information with the US 2016 Census data, we determined that we can infer someone's race, sex, and age by their 3-digit zip code.

The raw data also includes many (approximately 120) variables related to the financial status of the applicant and the type of loan issued. We used these variables to build our investment strategies. We completed data cleaning to make the data suitable for model construction.

Methods

Comparison of profits from fully paid vs charged off loans

In order to build an investment strategy, we first had to establish that there was variation in the profit made from the loans. We determined that loans were primarily fully paid, charged off, or current. The future profitability of the current loans is uncertain, so we focused on the fully paid and charged off loans. A charged off loan is one which the consumer has become delinquent on the debt and the investor is highly unlikely to ever receive repayment. A profitable investment strategy would distinguish between applications that will go on to be fully paid off rather than charged off.




On the left we plotted a distribution of the profitability of loans marked as fully paid vs charged off. All fully paid loans are profitable, as expected. A minority of charged off loans still led to a profit because they were mostly paid off when the consumer became delinquent. A majority of the charged off loans were not profitable, with a large spread of money lost.


Determining the demography of zip code regions

In order to verify the fairness of an investment strategy, we need to be able to infer someone's membership in a protected class (race, sex, or age) from their geographic location. We calculated the proportion of citizens in each 3-Digit Zip Code who were elderly (>60 years old), male and female, and who belonged to each major racial group.

Below is a series of plots illustrating how much racial demographic variables vary by 3-Digit Zip Code. In particular, it illustrates that White, Black, and Hispanic populations vary substantially by Zip Code.



Below is a series of plots illustrating how much female populations and elderly populations vary by 3-Digit Zip Code. We can see that sex has small variation from one area to the next, while the elderly populations has substantial variation.


After completing this portion of the EDA, we concluded that going forward with our goal of evaluating fairness by race, age, and sex is reasonable. We can make inferences of these identities from the 3-Digit Zip Codes provided in the loan data.

Relationship between financial variables and profitability of loans

We will attempt to build a model that can take into account an applicant's financial variables and predict the profitability of a loan given to that applicant. Before building this model, we first looked at financial variables that we predicted would be related to profitability. It seemed intuitive that loan applicants with higher income and long-time employment would be more likely to pay back their loans in the future, while applicants with high debt-to-income ratios would be less likely to pay back their loans. Below, we plotted the relationship between profitability and these variables to test our intuitions. We found that our intuition was correct for income; applicants with higher income also tended to create more profit. In contrast, the relationship between employment length and profitability was more tenuous, as was the relationship between debt-to-income ratio and profitability. This EDA indicates that our intuitions are limited with regard to the true relationship between the financial variables and profitability. Building models that weight variables according to their relationship to profitability should provide an effective strategy.

Cleaning the data

To clean the data we wrote a series of functions that each performed a different cleaning operation. Below is a list of each function and the operation it performed.

Function: calculate_profit

We calculated the total amount of profit made on each loan by taking into account the total amount of payments received from the consumer and the size of the loan. This is the variable that we will build models to predict.

Function: drop_missing_cols

For each accepted loan, there were 151 variables giving information on the loan itself and on the financial situation of the applicant. We completed a series of cleaning steps in order to prepare this data for our model building. First, some variables did not even have entries for over half of the observations. If we had more information on the inner-workings of how LendingClub collected this information, then we would have a better idea of why these variables are missing. Perhaps they purposefully don't collect every variable in every applicant, or perhaps applicants have a lot of flexibility in which questions they answer. Because we can't infer the cause of the missing variables, we simply decided to remove the ones that were missing entries for the majority of applicants. This reduced the number of variables to 88.


Below is a plot of the proportion of missing values per variable, illustrating the fact that some variables were nearly missing for every observation. Any variables that were above the 50% missing line were removed.


Function: remove_future_cols

Many of the variables depend on information collected after the loan was accepted. Because our goal is to build an investment strategy that could be used at the time of application, we needed to remove these variables as well. Below is a table with each of these variables and their descriptions. This table illustrates that this step drastically reduced the number of variables included in the model.

Function: clean_objects

Many variables were originally stored as objects, so we needed to assess how to reformat them. The plot below shows all variables with less than 20 unique values. We wrote the clean_objects function to appropriately convert each of these object variables. We re-formatted variables with a small number of categorical values using one-hot encoding. Other variables, like employment length, were formatted as continuous variables. This function also removed the zip code information for each loan and output this variable in a separate variable to be used for the demography EDA.

Function: clean_low_unique_values

As can be seen in the plot below, some variables all had the same value and did not have any variance. This function removed these variables.

Function: drop_text_col

This function removed variables with unstructured text responses.

Function: add_state_centroids

Each loan application came with data on the state of residence. We decided to convert this categorical variable into two continuous variables: the latitude and longitude of the state. To do so, we accessed a Kaggle dataset containing state centroids.

Function: mean_imputation

To handle the remaining missing values, we used mean imputation.

Function: add_expected_profit

We added a new variable called expected_profit, which took into account the loan amount, term length, and interest rate, to estimate the amount of profit one could expect from a given loan.

All these functions were combined into a single data cleaning function seen below. More information on each function can be found in the source code.

def clean_data(df, state_centroid):
    df = calculate_profit(df)
    df = drop_missing_cols(df)
    df = remove_future_cols(df)
    df = clean_low_unique_values(df)
    df = drop_text_col(df)
    df, zip_codes = clean_objects(df)
    df = add_state_centroids(df, state_centroid)
    df = mean_imputation(df)
    df = add_expected_profit(df)
    return df, zip_codes