Machine Learning concepts and Understanding the Mortgage/Loan Approval process

The Boston HMDA dataset and Machine Learning

Mortgage Finance is integral to funding small and large enterprises where collateral is sought out by banks to back loans. Ever wonder how banks vet and approve mortgage applications. This is not dissimilar to asking what factors conspire to making a sale or convert a web browser activity to a sale. Of course, some data would be useful here in determining the process. Rather than ask directly from the bank what procedures/criteria it adheres to when granting mortgages , it may be more revealing to allow a Machine Learning algorithm to tease out this process. This approach could be useful to regulators if they wish to determine whether small businesses are discriminated against. (This line of thought is a bit explosive, so I might park this question here for the moment.) At a very minimum, we would require data to predict the binary outcomes: Success/Failure, Survive/Perish, Mortgage Approved/Denied. As it so happens, Hal Varian (2014) https://pubs.aeaweb.org/doi/pdf/10.1257/jep.28.2.3 provides an interesting case for the Boston HMDA replete with predictors for mortgage origination. This dataset is now very old but nevertheless useful for sharpening our teeth. Please follow link: https://www.openicpsr.org/openicpsr/project/113925/version/V1/view to ml-data and then select the HMDA folder. Entrepreneurs may also have proprietary binary data that can be organised in a relational database that can be explored using same R machine learning packages. The purpose of setting out the Machine Learning example here, is to demonstrate the relative ease of engaging with this type of technology. To understand a little the timing of mortgage repayments and some other preliminaries you might check out the following three video links (otherwise skip if you are already familiar with mortgage math and amortization) :

Estimating the Mortgage Repayments

Below a quick introduction to basic mortgage math. How to estimate the monthly repayment on a mortgage? Also, I demonstrate how to estimate an amortization schedule in Excel>

or even in google sheets (if that is your preferred poison):

Developing a machine learning algorithm to examine factors influence mortgage approval process

Now with this in mind, you might then consider how to use the data provided in Hal Varian's 2014 paper. Hal applies a Machine Learning tree-based estimators akin to the ctree developed to predict survivorship on the Titanic. The Boston HMDA dataset consists of 2380 observations of 12 predictors, one of which was race. (This is a relatively small dataset not unlike the scale of transactions that might have been recorded by a micro-entrepreneur). The incorporated into the analysis include:

dir: debt payments to total income ratio

hir; housing expenses to income ratio

lvr: ratio of size of loan to assessed value of property

ccs: consumer credit score from 1 to 6 (a low value being a good score)

mcs: mortgage credit score from 1 to 4 (a low value being a good score)

pbcr: public bad credit record ?

dmi: denied mortgage insurance ?

self: self employed ?

single: is the applicant single ?

uria: 1989 Massachusetts unemployment rate in the applicant's industry

condominium: is unit a condominium ? (was called comdominiom in version 0.2-9 and earlier versions of the package)

black: is the applicant black ?

deny: mortgage application denied ?

The video playlist and Figure 5 from Hal Varian’s paper show how to generate a conditional inference tree estimated using the R package party. As might be observed from Figure 5, the most important variable is dmi = “denied mortgage insurance”. This variable would appear to be a strong indicator. The race variable, in contrast, shows up far down the tree and seems to be relatively less important. The black bars signify the fraction of each group that were denied mortgages. Hal concedes that it is feasible that there was racial discrimination embedded elsewhere in the mortgage process, or that some predictors included are highly correlated with race.

This segment of R code presented below was obtained from Hal Varian's 2014 paper. The graphs that feature in video above were generated with this R code.

###################################################

# R code for "Big Data: New Tricks for Econometrics

# Journal of Economic Perspectives 28(2), 3-28

# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3

# Hal R. Varian

###################################################

# Please follow link: https://www.openicpsr.org/openicpsr/project/113925/version/V1/view

library(Ecdat)

library(party)

data(Hdma)

# fix annoying spelling error

names(Hdma)[11] <- "condo"

# dir: debt payments to total income ratio;

# hir: housing expenses to income ratio;

# lvr: ratio of size of loan to assessed value of property;

# ccs: consumer credit score;

# mcs: mortgage credit score;

# pbcr: public bad credit record;

# dmi: denied mortgage insurance;

# self: self employed;

# single: applicant is single;

# uria: 1989 Massachusetts unemployment rate applicant's industry;

# condominiom: condominium;

# black: race of applicant black;

# deny: mortgage application denied;

################################

# all

################################

all <- Hdma[complete.cases(Hdma),]

all.fit <- ctree(deny ~ .,data=all)

# Figure 5 in paper

#pdf("all.pdf",height=8,width=16)

plot(all.fit)

graphics.off()

#pdf("all.pdf")

# small version of plot in case it is needed

small.dat <- with(all,data.frame(deny,dmi,black))

small.fit <- ctree(deny ~ .,data=small.dat)

plot(small.fit)

graphics.off()

Combining Machine Learning with Logit

Machine Learning and models of Qualitative Choice can be used to try develop some alternative perspectives. See video just below. Machine Learning and qualitative choice models both produce marginal results that support the view that race can not be excluded as a factor in affecting the likelihood of mortgage origination.

We obtain the following output when we run the logit model:

P( Deny ) = F( -4.13 + 5.37 * dir + 1.27 * Black )

The output from the following logit estimation:

Call:

glm(formula = deny ~ dir + black, family = binomial(link = "logit"),

    data = all)

Deviance Residuals:

    Min       1Q   Median       3Q      Max

-2.3709  -0.4732  -0.4219  -0.3556   2.8038

Coefficients:

            Estimate Std. Error z value Pr(>|z|)

(Intercept)  -4.1256     0.2684 -15.370  < 2e-16 ***

dir           5.3704     0.7283   7.374 1.66e-13 ***

blackyes      1.2728     0.1462   8.706  < 2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom

Residual deviance: 1591.4  on 2377  degrees of freedom

AIC: 1597.4

Note that the co-efficient for Black is Statistically significant in the logit model. See R code just below to replicate estimation in logit.

Ctree and Logit compared

Below, I combine Hal Varian's R code from "Big Data: New Tricks for Econometrics" and some R code from "Introduction to Econometrics with R". The latter is used to demonstrate how the Logit model can be used for predictions. The Logit model seems to present stronger evidence that race a was determining factor.

###################################################

# R code for "Big Data: New Tricks for Econometrics

# Journal of Economic Perspectives 28(2), 3-28

# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3

# Hal R. Varian

###################################################

library(Ecdat)

library(party)

data(Hdma)

# fix annoying spelling error

names(Hdma)[11] <- "condo"

# dir: debt payments to total income ratio;

# hir: housing expenses to income ratio;

# lvr: ratio of size of loan to assessed value of property;

# ccs: consumer credit score;

# mcs: mortgage credit score;

# pbcr: public bad credit record;

# dmi: denied mortgage insurance;

# self: self employed;

# single: applicant is single;

# uria: 1989 Massachusetts unemployment rate applicant's industry;

# condominiom: condominium;

# black: race of applicant black;

# deny: mortgage application denied;

# inspect the data

head(Hdma)

summary(Hdma)

# Mean P/I ratio

mean(Hdma$dir)

# inhouse expense-to-total-income ratio

mean(Hdma$hir)

# loan-to-value ratio

mean(Hdma$lvr)

# consumer credit score

mean(as.numeric(Hdma$ccs))

# mortgage credit score

mean(as.numeric(Hdma$mcs))

# public bad credit record

mean(as.numeric(Hdma$pbcr))

# denied mortgage dmi

prop.table(table(Hdma$dmi))

# self-employed

prop.table(table(Hdma$self))

# single

prop.table(table(Hdma$single))

# high school diploma

prop.table(table(Hdma$hschool))

# urialoyment rate

mean(Hdma$uria)

# condoium

prop.table(table(Hdma$condo))

# black

prop.table(table(Hdma$black))

# deny

prop.table(table(Hdma$deny))

################################

# all

################################

all <- Hdma[complete.cases(Hdma),]

all.fit <- ctree(deny ~ .,data=all)

# public bad credit record

mean(as.numeric(all$pbcr))

# Figure 5 in paper

#pdf("all.pdf",height=8,width=16)

plot(all.fit)

graphics.off()

#pdf("all.pdf")

# small version of plot in case it is needed

small.dat <- with(all,data.frame(deny,dmi,black))

small.fit <- ctree(deny ~ .,data=small.dat)

plot(small.fit)

graphics.off()

##############################

# From Book

# Introduction to Econometrics with R

# Based on Stock and Watson Book

# https://www.econometrics-with-r.org/1-introduction.html

#help("StockWatson2007")

# load `AER` package and attach the Hdma data

#library(AER)

# Data and Examples from Stock and Watson (2007)

#convert 'deny' to numeric

all$deny <- as.numeric(all$deny) - 1

# estimate a simple linear probabilty model

denymod1 <- lm(deny ~ dir, data = all)

denymod1

# plot the data

plot(x = all$dir,

     y = all$deny,

     main = "Scatterplot Mortgage Application Denial and the Monthly Debt-to-Income Ratio",

     xlab = "dir",

     ylab = "Deny",

     pch = 20,

     ylim = c(-0.4, 1.4),

     cex.main = 0.8)

# add horizontal dashed lines and text

abline(h = 1, lty = 2, col = "darkred")

abline(h = 0, lty = 2, col = "darkred")

text(2.5, 0.9, cex = 0.8, "Mortgage denied")

text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add the estimated regression line

abline(denymod1,

       lwd = 1.8,

       col = "steelblue")

# print robust coefficient summary

# coeftest(denymod1, vcov. = vcovHC, type = "HC1")

# rename the variable 'black' for consistency

#colnames(all)[colnames(all) == "black"] <- "black"

# estimate the model

denymod2 <- lm(deny ~ dir + black, data = all)

# coeftest(denymod2, vcov. = vcovHC)

summary(denymod2)

denylogit <- glm(deny ~ dir,

                 family = binomial(link = "logit"),

                 data = all)

#coeftest(denylogit, vcov. = vcovHC, type = "HC1")

denylogit

#plot data

plot(x = all$dir,

     y = all$deny,

     main = "Logit Model of the Probability of Denial, Given Debt/Income Ratio",

     xlab = "Debt/Income ratio",

     ylab = "Deny",

     pch = 20,

     ylim = c(-0.4, 1.4),

     cex.main = 0.9)

# add horizontal dashed lines and text

abline(h = 1, lty = 2, col = "darkred")

abline(h = 0, lty = 2, col = "darkred")

text(2.5, 0.9, cex = 0.8, "Mortgage denied")

text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models

x <- seq(0, 3, 0.01)

y_logit <- predict(denylogit, list(dir = x), type = "response")

lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend

legend("topleft",

       horiz = TRUE,

       legend = c( "Logit"),

       col = c("black"),

       lty = c( 2))

#estimate a Logit regression with multiple regressors

denylogit2 <- glm(deny ~ dir + black,

                  family = binomial(link = "logit"),

                  data = all)

#coeftest(denylogit2, vcov. = vcovHC, type = "HC1")

summary(denylogit2)

# 1. compute predictions for Debt/Income ratio = 0.3

predictions <- predict(denylogit2,

                       newdata = data.frame("black" = c("no", "yes"),

                                            "dir" = c(0.3, 0.3)),

                       type = "response")

predictions

# 2. Compute difference in probabilities

diff(predictions)

For a more and comprehensive introduction to qualitative choice models as applied to the Boston HDMA dataset please follow link to chapter 11 of the online text: Introduction to Econometrics with R.

Loan to Value and Mortgage Origination: Machine Learning for non-binary outcomes

Bostic (1996), see link to paper, re-examines claims that non-economic discrimination persists in mortgage loan origination decisions. Bostic (1996) found that racial differences in outcomes do exist, as minorities fare worse regarding debt-to-income requirements (dir). (Consistent with the logit analysis above.) He also claimed that minorities do better for loan-to-value (lvr) requirements. Bosnic (1996) states that "significant racial differentials exist only for 'marginal' applicants and are not present for those with higher incomes or those with no credit problems. Thus, the claim that non-economic discrimination is a general phenomenon is refuted. Further, I can say little regarding the existence of discrimination among “marginal” applicants. To conclude that such discrimination exists, one must prove that the observed differences are not due to economic factors." To tease this out a bit we might initially address loan-to-value and its relevance for advancing loans. The loan-to-value ratio (lvr) is a financial term used by banks to express the ratio of a loan to the value of an asset acquired. lvr is the ratio of size of loan to assessed value of property. Loan to value is one of the major risk factors that mortgage originators assess when vetting qualifying mortgagors. The risk of financial distress is central to lending decisions, and the probability of a lender incurring a loss increases as the amount of equity decreases. Therefore, as the lvr of a loan increases, the qualification guidelines for certain mortgage programs become much more strict. Banks may as a result impose hard lvr rules. This could be another way discriminate across groups particularly against those who a priori experienced higher levels of financial distress. The hypothesis we might then address is the extent to which lvr guidelines can be contorted to impose a form of loan to value Redlining In the video below , I use the Machine Learning and ctrees to investigate if evidence of credit rationing can be observed that possibly could be disadvantageous to any particular cohort. The Boston HDMA dataset is used as before.

########################################

# lvr redlining evidence for and against

# the Boston HDMA dataset

library(Ecdat)

library(party)

data(Hdma)

all <- Hdma[complete.cases(Hdma),]

# fix annoying spelling error

names(Hdma)[11] <- "condo"

#######################################

# predicting lvr without filtering

# Building a ctree to model lvr

all.fitlvr <- ctree(lvr ~ .,data=all)

plot(all.fitlvr)

graphics.off()

# predicting lvr filtering out deny mortgage application cohort

library(dplyr)

str(all)

allaccept <- filter(all, deny == "no")

all.fitacceptlvr <- ctree(lvr ~ .,data=allaccept)

all.fitacceptlvr

plot(all.fitacceptlvr)

graphics.off()

nd2 <- filter(all, deny == "no", dir <= 0.217)

nd5 <- filter(all, deny == "no", dir > 0.217, mcs <= 1, black == "yes" )

nd7 <- filter(all, deny == "no", dir > 0.217, mcs <= 1, black == "no", dir < 0.32 )

nd8 <- filter(all, deny == "no", dir > 0.217, mcs <= 1, black == "no", dir >= 0.32 )

nd10 <- filter(all, deny == "no", dir > 0.217, mcs > 1, black == "yes" )

nd12 <- filter(all, deny == "no", dir > 0.217, mcs > 1, black == "no", mcs <= 3 )

nd13 <- filter(all, deny == "no", dir > 0.217, mcs > 1, black == "no", mcs > 3 )

nd2

nd2$lvr

nd5

nd5$lvr

nd7$lvr

nd8$lvr

nd10$lvr

nd12$lvr

nd13$lvr

summary(nd2$lvr)

summary(nd5$lvr)

summary(nd7$lvr)

summary(nd8$lvr)

summary(nd10$lvr)

summary(nd12$lvr)

summary(nd13$lvr)

boxplot(nd2$lvr,nd5$lvr,nd7$lvr,nd8$lvr,nd10$lvr, nd12$lvr, nd13$lvr)

The R code above can be used to produce the ctree just below. We excluded the deny "yes" mortgages approvals so that we just are left with deny "no" cases. We used the dplyr package to render filtering and then ran the ctree to see if lvr is noticeably different across cohorts. A lower loan to value denotes more conservative or prudential lending practice. This however is not immediate obvious from the graph blow.

Loan to Value: evidence from Machine Learning investigating credit rationing

The terminal nodes outputs are not glaringly suggestive that banks imposed lower loan to value terms on African American borrowers. Race is identified as a factor but not in the way we might have expected. The ctree Machine Learning graph was generated using the party package in RStudio. The evidence that Banks are applying a hard credit rationing policy or a hard lvr based on race is not clear. This might not be too counter-intuitive given that banks are known to want to make a profit on the backs of everybody. They may, in fact want to quite naturally maximize lending activity notwithstanding the historical anecdotage suggestive of segregationist/discriminatory behavior. Obviously, we should start by taking a look at the median values comparing node 5 to nodes 7 and 8. The median lvr value for node 5 is actually higher than the other two - at least at first glance. Likewise, node 10 also appears to have a higher median lvr relative to nodes 12 and 13. This is not consistent with hard credit rationing being applied to the African-American demographic. The results obtained here are not unlike Bostic (1996): "These results are consistent with the proposition that lenders may use different 'rules-of-thumb' in considering loan applications across races. The preceding analysis suggests that these differences have a very particular quality. Minorities are not penalized along the loan to-value dimension, as rejection probabilities for minority applicants do not vary over a wide range of loan-to-value ratios. On the other hand, minority applicants face significantly more stringent debt-to-income requirements. Further, the influence of race changes over ranges of these variables. The divergence in outcomes based on race decreases as an applicant’s debt burden decreases and as the loan-to-value ratio increases. " In the video below, we inspect the predicted values or terminal nodes more closely by filtering out the data using dplyr according to the values, weights or criteria shaping each branch or split established in the ctree. This allows us to move beyond just eyeballing the ctree diagram as presented above and we can then move on to produce summary statistics for the relevant subsets. It is clear for those branches that split along black yes or no lines that the mean value of lvr for black "yes" nodes are higher than black "no" analogue leaves. This would suggest that discrimination can come in many forms and in ways that have not always be considered in the literature. Banks also are motivated by a desire to make profits and may not have incentives to discriminate against particular cohorts by applying hard credit rationing. In fact, banks may be more disposed in some instances to lending quite liberally and this while not overtly red lining/segregationist could be predatory.

Page updated

Google Sites

Report abuse