Boston HMDA Machine Learning

Machine Learning for training a Mortgage Approval Algorithm

Varian (2014) revisits the classic mortgage lending discrimination dataset developed by Munnell, Tootell, Browne, and McEneaney (1996) of the Boston Federal Reserve and applies machine learning techniques: conditional inference tree estimation and Randomforest.The Boston HMDA dataset was originally developed by Munnell et al (1996) to determine the extent to which racial discrimination features in loan approval decision making. Here we continue in that vein but we also observe the rich possibilities to explore the pathways to train machine learning algorithms to automate the mortgage approval process. As a host of disruptors, such as revolut, N26, robinhood, stripe, enter the financial arena one might wonder to what degree mortgage financing could evolve for consumers. Below, we follow Varian (2014) using mainly R code and then repurpose the HMDA dataset to train and test algorithms for mimicking the mortgage approval process.

R Code for Big Data: New Tricks for Econometrics

R code in Rstudio Big Data New Tricks for Econometrics

HMDA Machine Learning with ctree, Logit Modelling and RandomForest

Varian (2014) uses the following snippets of R Script and compares the relative performance of each of the following approaches: (1) cTree (2) Logit (3) RandomForests

Decision trees often present with the problem of variable selection bias and overfitting. One of more recent algorithms developed to mitigate this problem is relates to the self - pruning embedded in Conditional Inference Trees (CTREE), created by Hothorn, Hornik, and Zeileis (2006). The CTREE algorithm is considered unbiased because it selects the predictors through a "... global null hypothesis of independence between any of the m covariates and the response" (Hothorn et al., 2006, p. 2), followed by using statistical hypothesis testing and their p-values to inspect and choose the best predictors used in each split of the data and, in this way, build the tree. According to the authors: "If the global hypothesis can be rejected, we measure the association between Y and each of the covariates Xj, j = 1, . . . , m, by test statistics or P-values indicating the deviation from the partial hypotheses." (Hothorn et al., 2006, p. 3).

Logistic Regression provides a more traditional framework for solving classification problems. An intuitive explanation is provided here.

Random Forests generate many classification trees. To classify a new object from an input vector, we run that vector through each tree in the forest. Each tree leans towards a classification or exercises a vote. Majority vote wins. Alternatively, average wins when not pursuing classification. Random forest introduces additional randomness when growing the trees. Rather that unearth the most important feature while splitting branches, random forests probe to discover the best feature contained within a random subset of features. This yields greater diversity. When using the HMDA Boston data, the ctree mis-classifies 228 of the 2,380 observations - producing an error rate of 9.6 percent. In comparison, a straight logit model does somewhat better, mis-classifying 225 when predicting, producing an error rate of 9.5 percent. The random forest method mis-classified 223 of the 2,380 cases. Overall, the Random Forest approach produced a marginally better performance relative to the ctree.

###################################################

# R code for "Big Data: New Tricks for Econometrics

# Journal of Economic Perspectives 28(2), 3-28

# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3

# Hal R. Varian

###################################################


# load libraries and data

library(party)

library(Ecdat)

data(Hdma)

# fix annoying spelling error

names(Hdma)[11] <- "condo"


# for reproducibility

set.seed(1234)


####################################

# all complete cases, all predictors

####################################


all <- Hdma[complete.cases(Hdma),]

all.fit <- ctree(deny ~ .,data=all)

plot(all.fit)


all.pred <- predict(all.fit)

all.conf <- table(all$deny,all.pred)

all.conf

all.pred

all$deny


all.error <- all.conf[2,1]+all.conf[1,2]

all.error



#######################################

# no black predictor

#######################################


noblack <- all[,-12]

noblack.fit <- ctree(noblack$deny ~ .,data=noblack)

noblack.pred <- predict(noblack.fit)



# compare these predictions to the "all predictor" predictions

all.equal(all.pred,noblack.pred)



####################################

# remove predictors one-by-one and check error count

####################################


for (t in 1:12) {

drop1 <- all[,-t]

drop1.fit <- ctree(deny ~ .,data=drop1)

drop1.pred <- predict(drop1.fit)

drop1.conf <- table(drop1$deny,drop1.pred)

error <- (drop1.conf[2,1]+drop1.conf[1,2])

print(c(names(all)[t],format((error-all.error),digits=4)))

}


#######################################

# compare to logit

#######################################

logit.fit <- glm(deny ~ .,data=all,family="binomial")

logit.temp <- predict(logit.fit,type="response")

logit.pred <- logit.temp > .5

logit.conf <- table(all$deny,logit.pred)

logit.conf

logit.pred


logit.error <- logit.conf[1,2]+logit.conf[2,1]

logit.error


summary(logit.fit)


#######################################

# compare to random forest

######################################

library(randomForest)

randomForest # 4.5-36

#Type rfNews() to see new features/changes/bug fixes.


set.seed(1234)

rf.fit <- randomForest(deny ~ .,data=all,importance=T)

rf.pred <- predict(rf.fit,type="class")

rf.conf <- table(all$deny,rf.pred)

rf.conf

rf.pred


error <- rf.conf[1,2]+rf.conf[2,1]

error


imp <- importance(rf.fit)

rev(sort(imp[,3]))

imp

rf.fit

# importance plot

varImpPlot(rf.fit)

Confusion Matrices

A Confusion matrix layout permits some visualization of the performance of a Machine Learning classification algorithm. Each row of the tabulated matrix compares binary model generated predictions against true or actual bifurcated data. This allows for cross validation of model predictions against actual outcomes.

R code in Google Colab

In the Google Colab below, we set out the R code for Big Data: New tricks for Econometrics and replicate the output from Rstudio. Google Colab provides a useful means to estimate R notebooks and share with collaborators. The ctree graph output however was difficult to interpret and this seems is better visualized in a dedicated R environment.

Building an Algorithm to predict Mortgage Approval based on historical lending decisions

Munnell, Tootell, Browne, and McEneaney (1996) at the Boston Fed examined mortgage lending in Boston to determine if race played a significant role in determining who was approved for a mortgage. The primary econometric technique they relied upon was logistic regression where race was included as one of the predictors or independent variables. The coefficient on race showed a statistically significant negative impact on probability of getting a mortgage for minority applicants. This finding prompted considerable subsequent debate and discussion. Here we apply machine learning techniques of the type suggested by Varian (2014). The data consists of 2380 observations of 12 predictors, one of which was race.

We extend the analysis to consider how to train algorithms to automate the lending or mortgage approval process and then test algorithm against actual out-of-sample data. We use the sklearn library and import a number of models including Logistic Regression, SVMs, K Nearest Neighbours, Decision Trees and Random Forest classifiers. We then use historical lending patterns to shape eligibility and predict mortgage approval. The algorithms do nothing more than merely attempt to replicate the historical loan patterns of lending officers. The lending algorithms created therefore are not state of the art but do reflect historical norms - flawed or not. These benchmarks nevertheless could be applied to determine how patterns in lending change.