The Roy model is one of the more important "structural" or "behavioral" models in economics. Like the demand model, it is assumed that the individual unit faces a fixed number of options and the chosen one is preferred to the others. The difference is that in the demand model only the choice is observed, while in the Roy model both the choice and the outcome from the choice are observed. If we are willing to accept the behavioral assumption of the Roy model, then the observed data can provide a lot of information about the distribution of treatment effects.
The Roy model comes in a number of different flavors. But all attempt to use the outcomes we observe to infer something about the counter-factual outcomes for the alternative treatment, noting that the choice was not random. Rather than simply throwing up our hands when we find that assignment into treatment is not random, the Roy model says we can use that fact to make inference.
The strongest version of the revealed preference assumption states that for each individual their observed outcome in the assigned treatment is greater than their unobserved potential outcome in the treatment that was not assigned. So not only are individuals making rational choices they have "rational expectations." They know exactly what will happen to them or at least they guess what will happen to them with amazing accuracy. From this assumption and the observation that some individuals are assigned to a particular treatment we can infer that treatment is minimally causal. If we observe at least some individuals going to college then for those individuals their income from going to college must higher than their income from not going to college. For those individuals, college causes higher incomes.
A weaker version, and possibly the weakest version, of the revealed preference assumption states that for each individual the distribution of outcomes in their assigned choice, first order stochastically dominates the unobserved distribution of potential outcomes for that individual. Under this assumption, individuals do not have rational expectations, rather they know that the outcome in their preferred treatment is probably higher than their outcome in the alternative treatment. If we are going to make a revealed preference assumption, this one seems more reasonable. More "credible," in the language of Chuck Manski. Unfortunately, as shown in Bounds Math and discussed in Order Bounds, the revealed-preference assumption is not enough to make causal statements from observational data (it is not minimally causal).
While the revealed-preference assumption cannot work by itself, Roy Math, shows that it is possible for observational data to show minimally causal results when revealed-preference is combined with instrumental variables. Following an idea Chuck Manski calls "level set restrictions," it shows that even Natural Bounds can be informative with respect to causal effects. It shows that because the instrument only changes the probability of assignment to treatment and not the treatment outcomes (by assumption), there may be enough difference in the observed outcomes between treatments and enough difference in assignment probabilities by observed subset, that we can make a minimally causal claim. While natural bounds can show a result, the result is more likely if additional restrictions such as revealed-preference are placed on the problem.
Consider determining if completing high school increases income. In Order Bounds I show that if we are willing to make a monotone treatment matching assumption, we can show that completing high does in fact cause some people to get higher incomes (in NLS 66). What if we are not willing to make that assumption, but would rather make a revealed preference assumption. In addition we are willing to assume that proximity to a 4-year college increases the likelihood of completing high school but does not other wise increase income.
The chart to the left presents results from combining a revealed preference assumption of first order stochastic dominance with an assumption that proximity to a 4-year college is a valid instrument. The chart shows the estimated cumulative distribution of log wages for the two treatment groups.
Unfortunately, our worst-case estimate of the income distribution conditional on completing high school, always sits below our best-case estimate of the income distribution conditional on not-completing high school. That is, we are unable to categorically state that at least some people have higher incomes because they finished high school.
While it is possible for this method to show a minimally causal effect, in this case it does not.
A substantially stronger version of the Roy Model states that the observed outcome for the individual unit in the observed treatment must be greater than the unobserved potential outcome for that unit in the alternative treatment. Heckman and Honore (1990) show that with enough variation in prices (instrumental variables), it is possible to identify the whole joint treatment distribution. That is, this model allows the possibility of completely estimating the individual treatment distribution. The problem is that the model requires a lot of the individual and the data. It requires that the individual knows their potential outcomes and the "prices" associated with choosing different treatments. Moreover, it is necessary to make strong assumptions about the individual's utility representation.
Consider the problem of estimating the joint distribution of treatment outcomes for two alternative of completing high school and not completing high school. This version of the Roy model assumes that the individual makes the choice on whether to complete high school based on their (rational) expectations of their income ten years hence from either choice. In addition, while their exist instruments like proximity to a 4-year college, it is far from clear how to incorporate this information into the individual's utility function and thus our inference of their potential income in the alternative choice.
That is not to say there aren't efforts to work around these issues. The interested reader may consider the handbook chapter by Heckman and Vytlacil (2007).
# load data in from proximity.zip
# http://davidcard.berkeley.edu/data_sets.html
x <- read.delim("nls.dat",sep="",header=FALSE, stringsAsFactors = FALSE) # SAS data
y <- read.csv("names.csv",stringsAsFactors = FALSE,header = FALSE) # created from a file of variable names in the log file.
colnames(x) <- as.vector(y$V1)
x$lwage76 <- as.numeric(x$lwage76)
# IV SD
ncoll_wage_max <- ifelse(x[x$nearc4==1,]$ed76>11,lwage_max,x$lwage76)
lower_bound <- ks.test(ncoll_wage_max, x[x$nearc4==1,]$lwage76, alternative = "less") # this one is not interesting.
upper_bound <- ks.test(ncoll_wage_max, x[x$nearc4==1,]$lwage76, alternative = "greater") # this one gives the appropriate value.
lower_bound
upper_bound
# Plot
F2_max <- ecdf(x[x$nearc4==1,]$lwage76)
F1_min <- ecdf(ncoll_wage_max)
plot(F2_max,main="Distribution of Income by Education Level",xlab="log wages")
lines(F1_min,col="red")
legend(4.3,1,c("Grade 12 or above (min)", "Grade 11 or Below (max)"),c("black","red"))