Machine Learning concepts for the Small Entrepreneur

The Boston HMDA dataset and Machine Learning

Mortgage Finance is integral to funding small and large enterprises where collateral is sought out by banks to back loans. Ever wonder how banks vet and approve mortgage applications. This is not dissimilar to asking what factors conspire to making a sale or convert a web browser activity to a sale. Of course, some data would be useful here in determining the process. Rather than ask directly from the bank what procedures/criteria it adheres to when granting mortgages , it may be more revealing to allow a Machine Learning algorithm to tease out this process. This approach could be useful to regulators if they wish to determine whether small businesses are discriminated against. (This line of thought is a bit explosive, so I might park this question here for the moment.) At a very minimum, we would require data to predict the binary outcomes: Success/Failure, Survive/Perish, Mortgage Approved/Denied. As it so happens, Hal Varian (2014) https://pubs.aeaweb.org/doi/pdf/10.1257/jep.28.2.3 provides an interesting case for the Boston HMDA replete with predictors for mortgage origination. This dataset is now very old but nevertheless useful for sharpening our teeth. Please follow link: https://www.openicpsr.org/openicpsr/project/113925/version/V1/view to ml-data and then select the HMDA folder. Entrepreneurs may also have proprietary binary data that can be organised in a relational database that can be explored using same R machine learning packages. The purpose of setting out the Machine Learning example here, is to demonstrate the relative ease of engaging with this type of technology. To understand a little the timing of mortgage repayments and some other preliminaries you might check out the following three video links (otherwise skip if you are already familiar with mortgage math and amortization) :

Some Preliminaries: Estimating Mortgage Repayments

Below a quick introduction to basic mortgage math. How to estimate the monthly repayment on a mortgage? Also, I demonstrate how to estimate an amortization schedule in Excel. I also introduce VBA which is a key tool for automating spreadsheets.

The Windows and also currently Mac version of Excel supports programming through Visual Basic for Applications (VBA) . User-defined VBA code permits spreadsheet manipulation that is otherwise cumbersome or not feasible with standard spreadsheet techniques. Small snippets of code can be directly introduced replete with debugging and code module organization. Entrepreneurs can with tight budgets implement numerical methods as well as automating tasks such as formatting or data organization in VBA. Customization, dashboard visualisations and numeric functions can be easily crafted within this environment. This has the effect of putting the small entrepreneur in control. Below we set out an amortization schedule for paying down a mortgage or loan.

or even in google sheets (if that is your preferred poison):

Developing a machine learning algorithm to examine key factors determining mortgage approval.

Now with this in mind, you might then consider how to use the data provided in Hal Varian's 2014 paper. Hal applies a Machine Learning tree-based estimators akin to the ctree developed to predict survivorship on the Titanic. The Boston HMDA dataset consists of 2380 observations of 12 predictors, one of which was race. (This is a relatively small dataset not unlike the scale of transactions that might have been recorded by a micro-entrepreneur). The incorporated into the analysis include:

dir: debt payments to total income ratio

hir; housing expenses to income ratio

lvr: ratio of size of loan to assessed value of property

ccs: consumer credit score from 1 to 6 (a low value being a good score)

mcs: mortgage credit score from 1 to 4 (a low value being a good score)

pbcr: public bad credit record ?

dmi: denied mortgage insurance ?

self: self employed ?

single: is the applicant single ?

uria: 1989 Massachusetts unemployment rate in the applicant's industry

condominium: is unit a condominium ? (was called comdominiom in version 0.2-9 and earlier versions of the package)

black: is the applicant black ?

deny: mortgage application denied ?

The video playlist and Figure 5 from Hal Varian’s paper show how to generate a conditional inference tree estimated using the R package party. As might be observed from Figure 5, the most important variable is dmi = “denied mortgage insurance”. This variable would appear to be a strong indicator. The race variable, in contrast, shows up far down the tree and seems to be relatively less important. The black bars signify the fraction of each group that were denied mortgages. Hal concedes that it is feasible that there was racial discrimination embedded elsewhere in the mortgage process, or that some predictors included are highly correlated with race.

A First Step towards taming the Machine Learning beast

This segment of R code presented below was obtained from Hal Varian's 2014 paper. The analysis can be made simple and intuitive by running the code in the RStudio Desktop app or even using RStudio Cloud. The graphs that feature in video above were generated with this R code.

###################################################

# R code for "Big Data: New Tricks for Econometrics

# Journal of Economic Perspectives 28(2), 3-28

# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3

# Hal R. Varian

###################################################

# Please follow link: https://www.openicpsr.org/openicpsr/project/113925/version/V1/view

library(Ecdat)

library(party)

data(Hdma)

# fix annoying spelling error

names(Hdma)[11] <- "condo"

# dir: debt payments to total income ratio;

# hir: housing expenses to income ratio;

# lvr: ratio of size of loan to assessed value of property;

# ccs: consumer credit score;

# mcs: mortgage credit score;

# pbcr: public bad credit record;

# dmi: denied mortgage insurance;

# self: self employed;

# single: applicant is single;

# uria: 1989 Massachusetts unemployment rate applicant's industry;

# condominiom: condominium;

# black: race of applicant black;

# deny: mortgage application denied;

################################

# all

################################

all <- Hdma[complete.cases(Hdma),]

all.fit <- ctree(deny ~ .,data=all)

# Figure 5 in paper

#pdf("all.pdf",height=8,width=16)

plot(all.fit)

graphics.off()

#pdf("all.pdf")

# small version of plot in case it is needed

small.dat <- with(all,data.frame(deny,dmi,black))

small.fit <- ctree(deny ~ .,data=small.dat)

plot(small.fit)

graphics.off()

Combining Machine Learning with Logit

Machine Learning and models of Qualitative Choice can be used to try develop some alternative perspectives. See video just below. Machine Learning and qualitative choice models both produce marginal results that support the view that race can not be excluded as a factor in affecting the likelihood of mortgage origination.

We obtain the following output when we run the logit model:

P( Deny ) = F( -4.13 + 5.37 * dir + 1.27 * Black )

The output from the following logit estimation:

Call:

glm(formula = deny ~ dir + black, family = binomial(link = "logit"),

data = all)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.3709 -0.4732 -0.4219 -0.3556 2.8038

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.1256 0.2684 -15.370 < 2e-16 ***

dir 5.3704 0.7283 7.374 1.66e-13 ***

blackyes 1.2728 0.1462 8.706 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1744.2 on 2379 degrees of freedom

Residual deviance: 1591.4 on 2377 degrees of freedom

AIC: 1597.4

Note that the co-efficient for Black is Statistically significant in the logit model. See R code just below to replicate estimation in logit.

Ctree and Logit compared

Below, I combine Hal Varian's R code from "Big Data: New Tricks for Econometrics" and some R code from "Introduction to Econometrics with R". The latter is used to demonstrate how the Logit model can be used for predictions. The Logit model seems to present stronger evidence that race a was determining factor.

###################################################

# R code for "Big Data: New Tricks for Econometrics

# Journal of Economic Perspectives 28(2), 3-28

# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3

# Hal R. Varian

###################################################

library(Ecdat)

library(party)

data(Hdma)

# fix annoying spelling error

names(Hdma)[11] <- "condo"

# dir: debt payments to total income ratio;

# hir: housing expenses to income ratio;

# lvr: ratio of size of loan to assessed value of property;

# ccs: consumer credit score;

# mcs: mortgage credit score;

# pbcr: public bad credit record;

# dmi: denied mortgage insurance;

# self: self employed;

# single: applicant is single;

# uria: 1989 Massachusetts unemployment rate applicant's industry;

# condominiom: condominium;

# black: race of applicant black;

# deny: mortgage application denied;

# inspect the data

head(Hdma)

summary(Hdma)

# Mean P/I ratio

mean(Hdma$dir)

# inhouse expense-to-total-income ratio

mean(Hdma$hir)

# loan-to-value ratio

mean(Hdma$lvr)

# consumer credit score

mean(as.numeric(Hdma$ccs))

# mortgage credit score

mean(as.numeric(Hdma$mcs))

# public bad credit record

mean(as.numeric(Hdma$pbcr))

# denied mortgage dmi

prop.table(table(Hdma$dmi))

# self-employed

prop.table(table(Hdma$self))

# single

prop.table(table(Hdma$single))

# high school diploma

prop.table(table(Hdma$hschool))

# urialoyment rate

mean(Hdma$uria)

# condoium

prop.table(table(Hdma$condo))

# black

prop.table(table(Hdma$black))

# deny

prop.table(table(Hdma$deny))

################################

# all

################################

all <- Hdma[complete.cases(Hdma),]

all.fit <- ctree(deny ~ .,data=all)

# public bad credit record

mean(as.numeric(all$pbcr))

# Figure 5 in paper

#pdf("all.pdf",height=8,width=16)

plot(all.fit)

graphics.off()

#pdf("all.pdf")

# small version of plot in case it is needed

small.dat <- with(all,data.frame(deny,dmi,black))

small.fit <- ctree(deny ~ .,data=small.dat)

plot(small.fit)

graphics.off()

##############################

# From Book

# Introduction to Econometrics with R

# Based on Stock and Watson Book

# https://www.econometrics-with-r.org/1-introduction.html

#help("StockWatson2007")

# load `AER` package and attach the Hdma data

#library(AER)

# Data and Examples from Stock and Watson (2007)

#convert 'deny' to numeric

all$deny <- as.numeric(all$deny) - 1

# estimate a simple linear probabilty model

denymod1 <- lm(deny ~ dir, data = all)

denymod1

# plot the data

plot(x = all$dir,

y = all$deny,

main = "Scatterplot Mortgage Application Denial and the Monthly Debt-to-Income Ratio",

xlab = "dir",

ylab = "Deny",

pch = 20,

ylim = c(-0.4, 1.4),

cex.main = 0.8)

# add horizontal dashed lines and text

abline(h = 1, lty = 2, col = "darkred")

abline(h = 0, lty = 2, col = "darkred")

text(2.5, 0.9, cex = 0.8, "Mortgage denied")

text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add the estimated regression line

abline(denymod1,

lwd = 1.8,

col = "steelblue")

# print robust coefficient summary

# coeftest(denymod1, vcov. = vcovHC, type = "HC1")

# rename the variable 'black' for consistency

#colnames(all)[colnames(all) == "black"] <- "black"

# estimate the model

denymod2 <- lm(deny ~ dir + black, data = all)

# coeftest(denymod2, vcov. = vcovHC)

summary(denymod2)

denylogit <- glm(deny ~ dir,

family = binomial(link = "logit"),

data = all)

#coeftest(denylogit, vcov. = vcovHC, type = "HC1")

denylogit

#plot data

plot(x = all$dir,

y = all$deny,

main = "Logit Model of the Probability of Denial, Given Debt/Income Ratio",

xlab = "Debt/Income ratio",

ylab = "Deny",

pch = 20,

ylim = c(-0.4, 1.4),

cex.main = 0.9)

# add horizontal dashed lines and text

abline(h = 1, lty = 2, col = "darkred")

abline(h = 0, lty = 2, col = "darkred")

text(2.5, 0.9, cex = 0.8, "Mortgage denied")

text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models

x <- seq(0, 3, 0.01)

y_logit <- predict(denylogit, list(dir = x), type = "response")

lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend

legend("topleft",

horiz = TRUE,

legend = c( "Logit"),

col = c("black"),

lty = c( 2))

#estimate a Logit regression with multiple regressors

denylogit2 <- glm(deny ~ dir + black,

family = binomial(link = "logit"),

data = all)

#coeftest(denylogit2, vcov. = vcovHC, type = "HC1")

summary(denylogit2)

# 1. compute predictions for Debt/Income ratio = 0.3

predictions <- predict(denylogit2,

newdata = data.frame("black" = c("no", "yes"),

"dir" = c(0.3, 0.3)),

type = "response")

predictions

# 2. Compute difference in probabilities

diff(predictions)

For a more and comprehensive introduction to qualitative choice models as applied to the Boston HDMA dataset please follow link to chapter 11 of the online text: Introduction to Econometrics with R.

Page updated

Google Sites

Report abuse