In statistics we are often interested in determining the causal relationship between two variables. Say we are interested in predicting the impact of a policy that changes X on outcomes Y. In order to predict the impact of such a policy it may be useful to understand the causal relationship between X and Y. We may be contemplating a policy which mandates all students in a state complete high school. One way to measure of the value of such a policy is to look at the impact of education on income ten years in the future.
The chart above presents the distribution of 1976 log wages for students surveyed about education (and other things) in 1966. The chart shows that students that did not complete high school have lower income levels than students who did complete high school.
One problem with using this data to predict that compulsory high school will lead to higher incomes is confounding.
The figure to the left illustrates confounding. The arrows illustrate a causal relationship between the variables. X is the policy variable of interest. Y is the outcome of interest and U denotes characteristics that unobserved by the statistician. In the diagram, the unobserved characteristics have a causal effect on both the outcome and the policy variable.
This diagram represents the statement that correlation is not causation. The observed statistical relationship between X and Y measures both the causal arrow from X to Y and the causal relationship from U to X and U to Y. For policy analysis we are interested in only the causal relationship between X and Y as the policy change would only affect X and not U.
An instrumental variable is observed characteristic in the data that has a causal effect on the policy variable (X) but is not confounded unobserved characteristics. The instrumental variable is represented by Z in the diagram above. The arrow from Z to X represents the fact that the instrumental variable has a causal effect on X and the lack of an arrow from U to Z represents the fact that the instrumental variable is not confounded.
The randomization procedure used in randomized controlled trials is an example of an instrumental variable. The randomization procedure directly determines which treatment the patient receives and by design it is not affected by unobserved characteristics.
When studying the effect of attending college on income, David Card, suggested using the students proximity to college as an instrumental variable. The assumption is that student who live near a college are more likely to attend college, but there are no unobserved characteristics that determine both income and the fact that the student grew up near a college.
The chart above compares the distribution of wages for students living near a college and those students not living near a college. This effect is often called the "intent-to-treat effect". Intent to treat refers to the idea that the initial assignment of treatment was random (or exogenous) but the final assignment prior to observation of the outcomes may not have been. If we assume that proximity to college is in fact an instrumental variable then by the Frechet-Hoeffding bounds, this is evidence of a minimally causal effect.
# load data in from proximity.zip
# http://davidcard.berkeley.edu/data_sets.html
x <- read.delim("nls.dat",sep="",header=FALSE, stringsAsFactors = FALSE) #SAS data set
y <- read.csv("names.csv",stringsAsFactors = FALSE,header = FALSE) #created this file from the log file. A column of variable names.
colnames(x) <- as.vector(y$V1)
H1 <- ecdf(x[x$nearc4==0,]$lwage76)
H2 <- ecdf(x[x$nearc4==1,]$lwage76)
# Plot Intent-To-Treat
plot(H2,main="Distribution of Income by Education Level",xlab="log wages")
lines(H1,col="red")
legend(6.3,0.3,c("Near College", "Not Near College"),c("black","red"))