RStudio Cloud is a hosted version of RStudio in the cloud that makes it easy to engage in data analysis on the go - even for the tablet warriors among you. Cloud based computing implies storing and accessing data and programs over the Internet instead of your computer's hard drive. This allows for greater mobility and opportunities to collaborate. If you are just running R scripts, RStudio Cloud is a natural choice for setting up and executing data projects. It replicates the Desktop experience seamlessly even on basic tablets and has the virtue of permitting projects to be saved to the cloud. It also permits collaboration in real time among colleagues working on shared projects. To demonstrate some of practical attributes of RStudio, we implement some of Machine Learning models below provided by Varian (2014) in the form of R scripts made available with his Journal of Economic Perspective paper. We apply the Tidyverse suite to perform Exploratory Data Analysis and then apply machine learning libraries which Varian (2014) provides in the form of an interesting and thought provoking primer.
The "tidyverse" suite assembles some of the most versatile R packages such as ggplot2 for visualization and dplyr for data transformation. The packages work in harmony to clean, process, pivot, explore and present data. Below we perform some exploratory data analysis to get a feel for important attributes and properties in the Munnell, Tootell, Browne, and McEneaney (1996) dataset. To get background to the Varian (2014) HMDA example please follow link. An excellent introduction to many of the more novel tools and “tricks” from machine learning is provided by Varian (2014).
Varian (2014) revisits the classic mortgage lending discrimination study done by Munnell, Tootell, Browne, and McEneaney (1996) of the Boston Federal Reserve to show how it could be redone using machine learning techniques: conditional inference tree estimation and Randomforest. In this case, Varian (2014) observes that fitting varying tree models that omitted race as an explanatory variable to fit 1990 mortgage origination data as well as a tree model that included race produced mixed results and somewhat conflicting evidence. Varian (2014) notes that the race variable (black) shows up far down a conditional inference tree and seems to be relatively unimportant. Varian (2014) sets out to determine the importance of the race variable by excluding it from the prediction and then observe the performance of the modelling that included race by comparison. When this is done, it turns out that the accuracy of the tree based model doesn't change at all: exactly the same cases are misclassified in the Varian (2014) paper. Of course Hal conceded that it is perfectly possible that there was racial discrimination elsewhere in the mortgage process, or that some of the variables included are highly correlated with race. But he maintained that the tree model produced by standard procedures that omits race fits the observed data just as well as a model that includes race. For some insight into the urban demographics - see below distribution of population in the Boston Metropolitan Area based on "Whiteness". Please follow link to to an earlier 1992 draft of the Munnell, Tootell, Browne and McEneaney which was initially heavily challenged in the literature.
According to Ladd (1998) the Home Mortgage Disclosure Act (HMDA) was enacted to monitor minority and low-income access to the mortgage market. The data collected in 1990 for this purpose show that minorities are more than twice as likely to be denied a mortgage as whites. Yet variables correlated with both race and creditworthiness were omitted from these data, making any conclusion about race's role in mortgage lending impossible. The Federal Reserve Bank of Boston collected additional variables important to the mortgage lending decision and found that race continued to play an important, though a significantly diminished, role in the decision to grant a mortgage. To supplement HMDA data, Munnell, Tootell, Browne, and McEneaney (1996) at the Boston Fed sought the cooperation of lenders throughout the Boston metropolitan area. They examined 1990 loan applications from minorities in the Boston area, plus a random sample of applications from whites. For each application, the researchers asked lenders to provide an additional set of 38 pieces of information.The study was originally circulated in 1992, then revised in response to some of the early criticisms and published in the March 1996 issue of the American Economic Review (Munnell, Tootell, Browne, and McEneaney (1996))
The R script below can be loaded directly into RStudio Cloud and results can be obtained in web browser. Don't forget to save your work as results are obtained.
####################################################
# HMDA Boston tiyverse
# Exploratory Data Analysis
# Dataset Described in
# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3
# Hal R. Varian
###################################################
library(Ecdat)
library(tidyverse)
library(party)
?Hmda
data(Hdma)
# fix annoying spelling error
names(Hdma)[11] <- "condo"
# dir: debt payments to total income ratio;
# hir: housing expenses to income ratio;
# lvr: ratio of size of loan to assessed value of property;
# ccs: consumer credit score;
# mcs: mortgage credit score;
# pbcr: public bad credit record;
# dmi: denied mortgage insurance;
# self: self employed;
# single: applicant is single;
# uria: 1989 Massachusetts unemployment rate applicant's industry;
# condominiom: condominium;
# black: race of applicant black;
# deny: mortgage application denied;
# inspect the data
head(Hdma)
summary(Hdma)
str(Hdma)
view(Hdma)
# Proportions approved - no = approved
ggplot(Hdma, aes(x = deny)) +
theme_bw() +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial, no implies approved")
# Numbers with different ccs (the lower the better)
ggplot(Hdma, aes(x = ccs)) +
theme_bw() +
geom_bar() +
labs(y = "score",
title = "consumer credit score")
# Numbers with different mcs (the lower the better)
ggplot(Hdma, aes(x = mcs)) +
theme_bw() +
geom_bar() +
labs(y = "score",
title = "mortgage credit score")
# Parallelization of graphs
ggplot(Hdma, aes(x = deny)) +
theme_bw() +
facet_wrap(~ ccs) +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial for varying ccs")
################################
# all
################################
# exclude incomplete entries
all <- Hdma[complete.cases(Hdma),]
# Parallelization of graphs
ggplot(all, aes(x = deny)) +
theme_bw() +
facet_wrap(~ ccs) +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial for varying ccs")
# Deny relative to ccs and pbcr
ggplot(all, aes(x = deny, fill = pbcr)) +
theme_bw() +
facet_wrap(~ ccs) +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial for varying ccs and pbcr")
# Deny relative to ccs and dmi
ggplot(all, aes(x = deny, fill = dmi)) +
theme_bw() +
facet_wrap(~ ccs) +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial for varying ccs and dmi")
# breakdown of employed and self employed
ggplot(all, aes(x = self)) +
theme_bw() +
geom_bar() +
labs(y = "Self Employed",
title = "Self Employed")
# examining mortgage approval in relation to employed and self employed status
ggplot(all, aes(x = deny, fill = self)) +
theme_bw() +
facet_wrap(~ ccs) +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial for varying ccs and self-employed")
# Deny relative to mcs and dmi
ggplot(all, aes(x = deny, fill = dmi)) +
theme_bw() +
facet_wrap(~ mcs) +
geom_bar() +
labs(y = "Mortgage Deny Count",
title = "Mortgage denial for varying mcs and dmi")
# setting out a histogram for lvr
ggplot(all, aes(x = lvr)) +
theme_bw() +
geom_histogram(binwidth = 0.1) +
labs(y = "number of mortgage application in lvr band",
x = "lvr (binwidth = 0.05)",
title = "lvr Distribtion")
# exploring lvr and likely effects on mortgage approval
ggplot(all, aes(x = lvr, fill = deny)) +
theme_bw() +
geom_histogram(binwidth = 0.1) +
labs(y = "number of mortgage application in lvr band",
x = "lvr (binwidth = 0.1)",
title = "lvr Distribtion")
############################################################
logit.fitlvr <- glm(deny ~ lvr,data=all,family="binomial")
summary(logit.fitlvr)
logit.fit <- glm(deny ~ .,data=all,family="binomial")
summary(logit.fit)
###########################################################
# exploring the relationship between dir and hir
ggplot(all, aes(x = dir, y = hir)) +
geom_point()
# exploring the relationship between dir and hir
ggplot(data = all) +
geom_point(mapping = aes(x = dir, y = hir, color = ccs))
ggplot(data = all) +
geom_point(mapping = aes(x = dir, y = hir, color = deny))
all %>%
filter(dir < 1, hir < 1) %>%
ggplot() +
geom_point(mapping = aes(x = dir, y = hir, color = ccs))
all %>%
filter(dir < 1, hir < 1) %>%
ggplot() +
geom_point(mapping = aes(x = dir, y = hir, color = deny))
g <- ggplot(all, aes(x=dir, y=hir)) + geom_point() + geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
plot(g)
# exploring the relationship between dir and hir
ggplot(data = all) +
geom_point(mapping = aes(x = dir, y = hir, color = ccs)) +
facet_wrap(~ ccs, nrow = 2)
# exploring the relationship between dir and hir
ggplot(data = all) +
geom_point(mapping = aes(x = dir, y = hir, color = deny)) +
facet_wrap(~ ccs, nrow = 2)
# exploring the relationship between dir and hir
ggplot(data = all) +
geom_point(mapping = aes(x = dir, y = hir, color = deny)) +
facet_wrap(~ ccs, nrow = 2) +
geom_smooth(mapping = aes(x = dir, y = hir))
# exploring the relationship between dir and hir for dir < 2
all %>%
filter(dir < 1) %>%
ggplot() +
geom_point(mapping = aes(x = dir, y = hir, color = deny)) +
facet_wrap(~ ccs, nrow = 2) #+
# geom_smooth(mapping = aes(x = dir, y = hir))
cor(all$dir,all$hir)
all %>%
filter(dir < 1) %>%
ggplot() +
geom_point(mapping = aes(x = lvr, y = dir, color = deny)) +
facet_wrap(~ ccs, nrow = 2) +
geom_smooth(mapping = aes(x = lvr, y = dir))
cor(all$dir,all$lvr)
ggplot(data = all) +
geom_boxplot(mapping = aes(x = deny, y = dir))
ggplot(data = all) +
geom_boxplot(mapping = aes(x = deny, y = dir)) +
facet_wrap(~ ccs, nrow = 2)
all %>%
filter(dir < 2) %>%
ggplot() +
geom_boxplot(mapping = aes(x = deny, y = dir)) +
facet_wrap(~ ccs, nrow = 2)
all %>%
filter(lvr < 2) %>%
ggplot() +
geom_boxplot(mapping = aes(x = deny, y = lvr)) +
facet_wrap(~ ccs, nrow = 2)
all %>%
filter(lvr < 2) %>%
ggplot() +
geom_boxplot(mapping = aes(x = deny, y = lvr, color = self)) +
facet_wrap(~ ccs, nrow = 2)
#lvr boxplot for african american relative to rest of population
all %>%
filter(lvr < 2) %>%
ggplot() +
geom_boxplot(mapping = aes(x = deny, y = lvr, color = black)) +
facet_wrap(~ ccs, nrow = 2)
# The following pivot tables provide another tool aggregating and summarising relationships in data
pivot1 <- all %>%
group_by(deny) %>%
summarize(Medianlvr = median(lvr, na.rm=TRUE),
count = n()) %>%
arrange(deny)
View(pivot1)
pivot2 <- all %>%
group_by(deny, dmi) %>%
summarize(Medianlvr = median(lvr, na.rm=TRUE),
count = n()) %>%
arrange(deny, dmi)
View(pivot2)
all.fit <- ctree(deny ~ .,data=all)
# Figure 5 in paper
#pdf("all.pdf",height=8,width=16)
plot(all.fit)
graphics.off()
pivot3 <- all %>%
group_by( deny, ccs, mcs) %>%
summarize(meandir = mean(dir, na.rm=TRUE),
count = n()) %>%
arrange(deny, ccs, mcs)
View(pivot3)
ggplot2 provides strong visualisation functionality. It comes as part of the tidyverse suite. Below, we explore the interelatedness of variables. Relatively strong patterns emerge through visualisation which can later be more systematically interpreted using machine learning techniques. Exploratory Data Analysis is precisely that - so while we may have a priori expectations you should also permit some free flow in allowing the Tidyverse graphs to reveal some of the underlying patterns you may had not anticipated.
Box plots graphically depict numerical data by identifying their quartiles. Box plots, in some formats, may additional have lines extending from the boxes (whiskers) signalling variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers typically are plotted as individual points. Box plots importantly are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. Here we compare the loan-to-value ratios for self-employed and African American applicants. Some interesting trends emerge that may challenge/complicate the overt charge of racial discrimination.
A higher loan-to-value ratio creates risk for Financial Institutions. To understand this it is worth taking a look at the Merton Model (1974) when considering credit risk. Reducing the equity stake of borrower tends to increase the value of the guarantee provided implicitly by the lender. It is quite clear from the tidyverse analysis that African American mortgage applicants who were successful in being awarded mortgages had higher median lvrs. This is not consistent with negative discrimination in any conventional sense that we might easily understand. It is also not easily reconcilable with the central thrust of Munnell et al (1996). Bostic (1996) points out that the typically higher lvr for Black applicants is consistent with the proposition that lenders may use different “rules-of-thumb” in considering loan applications across races. Differences in treatment have a very particular quality. "Minorities are not penalized along the loan-to-value dimension, as rejection probabilities for minority applicants do not vary over a wide range of loan-to-value ratios. On the other hand, minority applicants face significantly more stringent debt-to-income requirements. Further, the influence of race changes over ranges of these variables. The divergence in outcomes based on race decreases as an applicant’s debt burden decreases and as the loan-to-value ratio increases."
We extend the Boxplot analysis to incorporate Pivot tables. Excel is generally the default tool for most when setting up pivot tables. Tidyverse R however offers more flexibility and sophistication and we can implement more insightful tidyverse pivot tables in the video below to examine patterns in the Loan-to-Value ratios.
Varian (2014) points out that the tree model produced by standard procedures that omits race fits the observed data just as well as a model that includes race. This differs qualitatively from the findings of Munnell et al (1996).
Hal uses the following snippets of R Script and compares the relative performance of each of the following approaches:
(1) CTree
(2) Logit
(3) RandomForests
Random Forests generate many classification trees. To classify a new object from an input vector, we run that vector through each tree in the forest. Each tree leans towards a classification or exercises a vote. Majority vote wins. Alternatively, average wins when not pursuing classification. Random forest introduces additional randomness when growing the trees. Rather that unearth the most important feature while splitting branches, random forests probe to discover the best feature contained within a random subset of features. This yields greater diversity. When using the HMDA Boston data, the ctree mis-classifies 228 of the 2,380 observations - producing an error rate of 9.6 percent. In comparison, a straight logit model does somewhat better, mis-classifying 225 when predicting, producing an error rate of 9.5 percent. The random forest method mis-classified 223 of the 2,380 cases. Overall, the Random Forest approach produced a marginally better performance relative to the ctree.
###################################################
# R code for "Big Data: New Tricks for Econometrics
# Journal of Economic Perspectives 28(2), 3-28
# http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3
# Hal R. Varian
###################################################
# load libraries and data
library(party)
library(Ecdat)
data(Hdma)
# fix annoying spelling error
names(Hdma)[11] <- "condo"
# for reproducibility
set.seed(1234)
####################################
# all complete cases, all predictors
####################################
all <- Hdma[complete.cases(Hdma),]
all.fit <- ctree(deny ~ .,data=all)
all.pred <- predict(all.fit)
all.conf <- table(all$deny,all.pred)
all.conf
all.pred
all$deny
all.error <- all.conf[2,1]+all.conf[1,2]
all.error
#######################################
# no black predictor
#######################################
noblack <- all[,-12]
noblack.fit <- ctree(noblack$deny ~ .,data=noblack)
noblack.pred <- predict(noblack.fit)
# compare these predictions to the "all predictor" predictions
all.equal(all.pred,noblack.pred)
####################################
# remove predictors one-by-one and check error count
####################################
for (t in 1:12) {
drop1 <- all[,-t]
drop1.fit <- ctree(deny ~ .,data=drop1)
drop1.pred <- predict(drop1.fit)
drop1.conf <- table(drop1$deny,drop1.pred)
error <- (drop1.conf[2,1]+drop1.conf[1,2])
print(c(names(all)[t],format((error-all.error),digits=4)))
}
# You should get this output
# [1] "dir" "2"
# [1] "hir" "0"
# [1] "lvr" "6"
# [1] "ccs" "8"
# [1] "mcs" "0"
# [1] "pbcr" "12"
# [1] "dmi" "36"
# [1] "self" "0"
# [1] "single" "0"
# [1] "uria" "0"
# [1] "condo" "0"
# [1] "black" "0"
#######################################
# compare to logit
#######################################
logit.fit <- glm(deny ~ .,data=all,family="binomial")
logit.temp <- predict(logit.fit,type="response")
logit.pred <- logit.temp > .5
logit.conf <- table(all$deny,logit.pred)
logit.conf
logit.pred
logit.error <- logit.conf[1,2]+logit.conf[2,1]
logit.error
summary(logit.fit)
#######################################
# compare to random forest
######################################
library(randomForest)
randomForest # 4.5-36
#Type rfNews() to see new features/changes/bug fixes.
set.seed(1234)
rf.fit <- randomForest(deny ~ .,data=all,importance=T)
rf.pred <- predict(rf.fit,type="class")
rf.conf <- table(all$deny,rf.pred)
rf.conf
rf.pred
error <- rf.conf[1,2]+rf.conf[2,1]
error
imp <- importance(rf.fit)
rev(sort(imp[,3]))
imp
rf.fit
# importance plot
varImpPlot(rf.fit)
The Munnell et al (1996) paper found that before adjusting for any of the control variables, the loan denial rate was 10 percent for whites and 28 percent for minorities. This big differential is greatly reduced when personal and property characteristics are controlled for, since those characteristics tend to be disproportionately unfavorable to minorities. Once these variables are taken into account either through ordinary least squares regressions or logit models, the gap shrinks from 18 percentage points to about 8 percentage points. Ladd (1998) provides a useful summary of the Munnell et al (1996) results. In the analysis provided below we develop a little bit more in depth the Logistic Regression approach which was introduced by Varian (2014) and produced a relatively smaller amount of error.
For a more and comprehensive treatment of Logistic Regression and Probit qualitative choice models please follow link to chapter 11 of the online text: Introduction to Econometrics with R.
##############################
# From Book
# Introduction to Econometrics with R
# Based on Stock and Watson Book
# https://www.econometrics-with-r.org/1-introduction.html
# help("StockWatson2007")
# load `AER` package and attach the Hdma data
#library(AER)
# Data and Examples from Stock and Watson (2007)
#From Hal Varian
library(Ecdat)
library(party)
data(Hdma)
# fix annoying spelling error
names(Hdma)[11] <- "condo"
#convert 'deny' to numeric
all$deny <- as.numeric(all$deny) - 1
# estimate a simple linear probabilty model
denymod1 <- lm(deny ~ dir, data = all)
denymod1
# plot the data
plot(x = all$dir,
y = all$deny,
main = "Scatterplot Mortgage Application Denial and the Monthly Debt-to-Income Ratio",
xlab = "dir",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.8)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add the estimated regression line
abline(denymod1,
lwd = 1.8,
col = "steelblue")
# print robust coefficient summary
# coeftest(denymod1, vcov. = vcovHC, type = "HC1")
# rename the variable 'black' for consistency
#colnames(all)[colnames(all) == "black"] <- "black"
# estimate the model
denymod2 <- lm(deny ~ dir + black, data = all)
# coeftest(denymod2, vcov. = vcovHC)
summary(denymod2)
denylogit <- glm(deny ~ dir,
family = binomial(link = "logit"),
data = all)
#coeftest(denylogit, vcov. = vcovHC, type = "HC1")
denylogit
#plot data
plot(x = all$dir,
y = all$deny,
main = "Logit Model of the Probability of Denial, Given Debt/Income Ratio",
xlab = "Debt/Income ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.9)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_logit <- predict(denylogit, list(dir = x), type = "response")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)
# add a legend
legend("topleft",
horiz = TRUE,
legend = c( "Logit"),
col = c("black"),
lty = c( 2))
#estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ dir + black,
family = binomial(link = "logit"),
data = all)
#coeftest(denylogit2, vcov. = vcovHC, type = "HC1")
summary(denylogit2)
# 1. compute predictions for Debt/Income ratio = 0.3
predictions <- predict(denylogit2,
newdata = data.frame("black" = c("no", "yes"),
"dir" = c(0.3, 0.3)),
type = "response")
predictions
# 2. Compute difference in probabilities
diff(predictions)
In the video, we continue with the analysis presented in chapter 11 of the online text: Introduction to Econometrics with R. Also towards the end we illustrate how to share your RStudio Cloud project with others. This is important because it allows collaboration between colleagues and teams.
Day and Liebowitz (1998) criticize heavily the findings of Munnell, Tootell, Browne, and McEneaney (1996). The conclusions of latter precipitated a strong response from fellow scholars and researchers within the Federal Reserve. Day and Liebowtz (1998), for example, point out that:
"Rational mortgage lenders in competitive markets should approve any loan that has an expectation of earning a positive return. Although racial discrimination in commercial transactions might sometimes be a rational financial response to third party effects, the existence of financial gains from racial discrimination seems far less likely for mortgage lending. For example, in housing markets, real estate agents may discriminate against minorities because they are afraid of alienating potential white customers who might prefer not to have minorities in their neighborhoods. Similarly, the owners of retail establishments might discriminate against minority customers because their white customers prefer not to associate with minorities. Or white managers might discriminate against minority workers because their white workers prefer not to have minority coworkers. In each of these examples, the discriminator suffers a specific economic harm by engaging in discrimination: lost real estate commissions, lost sales, or lower productivity. This direct loss, however, might be outweighed by the indirect gain brought about by avoiding the alienation of a large customer base or work force. Thus economic self-interest and competition can not necessarily be counted on to keep discrimination at bay in a world where third parties are bigoted. For mortgage lenders, however, there is little concern with third party effects. Mortgage lenders making loans to minority applicants are not likely to suffer negative consequences from other customers for the simple reason that bigoted homeowners objecting to new minority neighbors have more direct objects of scorn -- the seller, or the real estate agent. Further, the source of the loan is generally unknown to the neighbors. Thus, economic self-interest punishes any act of bigotry in the home mortgage market more fully than might be expected in many other circumstances. Economic self-interest, therefore, should reduce racial discrimination in this market more completely than in many others. In addition, special programs and regulatory incentives inducing banks to increase their mortgage lending to minorities are countervailing forces that might be thought to provide minorities some advantages in securing mortgage financing. Additionally, it seems logical to expect that competitive forces should work to eliminate discrimination."