In economics and other applied statistics it is common to want to predict the impact of policies that do not occur in the data. Consider the problem of approving and providing a vaccine. In general, the trial data available on the vaccine is unlikely to provide the necessary information to determine the effectiveness of the policy. The trial data may provide information on the safety issues with the vaccine or with the effectiveness of the vaccine for an individual patient exposed to the disease. The trial data will not in general provide information about how the vaccine immunizes the population as a whole. If the disease is some type of contagion, then the vaccine has two effects. First, it reduces the likelihood that the person taking the vaccine will catch the disease. Second, it reduces the likelihood that a person who did not take the vaccine will catch the disease. In order to predict the total effect on the population we need a model of disease transition and we need estimates of the model's underlying parameters.
Similarly, consider we are interested in investing a public transportation system like the Bay Area Rapid Transit (BART) system. In order to estimate the value of such a policy we need to estimate the likelihood that various people will use the system and how much they would be willing to pay in fares and parking lot fees. The problem with such an estimation is that the BART may not exist. Commuters may have access to buses and cars but may never have ridden a subway system. One solution is to simply ask people. We could send out surveys to the population of the city and ask them whether they would use a subway system if one existed. An alternative is to model how people will behave when faced with access to a subway system and use data on their current behavior to estimate the required parameters of the model.
In order to estimate how people would behave when provided access to a subway system, Daniel McFadden, developed the discrete choice model of demand. In this model, individuals are assumed to have access to a finite set of choices, driving or using the bus, for example. Each choice is associated with a price which may include direct things like a bus fare or a toll, as well as indirect things like gasoline prices and car maintenance costs. In the data we observe the actual choice made by the individual when faced with the different options and their associated prices.
The picture to the right represents how we think about demand. The solid-curved line is the demand function which represents preferences of the population. The vertical dashed line is the price. The area under the curve lying above dashed line is the "share" or observed quantity purchased. From economic theory or revealed preference we know that everyone who purchased the product at a particular price, preferred that product over the alternatives given the observed prices.
Once estimated we can use the model to simulate the impact of various policies.
If we are able to observe a large number of different prices and the corresponding shares, we can map out our demand function.
The picture on the right presents data on the demand for various fishing "modes" such as charter boat fishing or fishing from a pier. The data comes from a survey where respondents were asked about the relative prices of various fishing modes, as well as their choice. The chart was created by calculating the log of relative prices for charter fishing and pier fishing. Then for each decile of relative prices, the probability of choosing charter fishing was calculated.
The picture shows what we imagine data on demand might look like. It seems to show people following the "law of demand." As the relative price of charter fishing increases, the relative demand for charter fishing falls.
Often, however, the data looks more like the picture on the left than the picture above. When faced with such data, economists have proposed a couple of solutions.
One solution is to expand the definition of "price". We can think of each product being represented by an "index" which is a linear combination of observed characteristics of the product. If the product is ready-to-eat cereal, then the index may include observed characteristics like sugar content and "sogginess".
The biggest problem with estimating demand is that the variation in prices that we do observe may not be determine exogenously. This is the classic identification problem in economics. When observing prices and quantities we don't know if the change in price is due to a change in demand or a change in supply.
If we think of "price" as our treatment and "share" as our outcome of interest, then this demand identification problem is no different from standard identification problems. As such we may think to use a standard solution like instrumental variables. The most obvious instrument to use is what are called "cost shifters." These are observed characteristics that are likely to affect the supply (and thus the price) of the product but are unlikely to be associated with changes in demand. They may include variation in energy costs or labor costs. The Demand Math section shows that, in general, standard economic theory directly contradicts the assumptions that need to hold for the IV approach. This is a problem.
Steven Berry came up with a very interesting solution.
In Berry's model different products in different markets have different unobserved characteristics which he labeled by the Greek letter Xi. This is the Greek X. In Berry's model, observed characteristics are labeled X, and unobserved characteristics are labeled Xi. Berry places the unobserved characteristics in the index with the observed characteristics and prices. It has been shown that if some additional assumptions are placed on the Berry model, the standard instrumental variable approach will work. However, things are turned around so our treatment is "share" and outcome of interest is "price" (or the index). Thus we need to find observed characteristics associated with variation in demand and only impact price through demand. Examples of such instruments may be demographic characteristics such as income.
The chart to the left presents "demand curve" for high income (above 80% of the income in the sample) and low income (below 80% of the income in the sample). It looks like demand for charter fishing is lower for those with higher incomes. Which may be true, although that assumes prices are determined exogenously across the sample.
Another possibility is that for each level of demand, high income people face higher prices for charter fishing relative to pier fishing. If prices are set endogenously then this will affect our estimates of demand.
Demand Math discusses the Berry model in more detail, but the interested reader is encouraged to work through the Berry and Haile paper, noting that it is very technical.
To illustrate the IV approach consider a simple version with the fishing mode data. If you estimate a linear probability model on price, the slope coefficient is -0.032, which represents the elasticity of demand for charter fishing. The concern is that prices are not determined exogenously and there are issues with using the standard IV approach. The Berry approach suggests estimating the inverse relationship and finding an instrument for demand. Our intent-to-treat regression is income on price and the first stage equation is income on share. A rough IV estimate is given by the intent-to-treat estimate divided by the first-stage estimate. If we invert that to get the effect of price on share, we have -0.045. This suggests that the relative demand for charter fishing is more elastic than we would we get if we did not account for the endogeneity of prices.
library(Ecdat)
x <- data("Fishing") # and wait.
x <- Fishing
x$mode <- as.character(x$mode)
# Create binary variable
x$y <- NA
x$y[x$mode=="charter"] <- 1
x$y[x$mode=="pier"] <- 0
x$ln_relp <- log(x$pcharter/x$ppier) #natural log
# Create demand curve
dy <- matrix(NA,10,2)
price_old <- min(x$ln_relp) - 0.00000001
for (i in 1:9) {
price_new <- quantile(x$ln_relp,i/10)
dy[i,1] <- price_new
dy[i,2] <- mean(x[x$ln_relp > price_old & x$ln_relp < price_new,]$y, na.rm = TRUE)
price_old <- price_new
}
dy[10,1] <- max(x$ln_relp)
dy[10,2] <- mean(x[x$ln_relp > price_old,]$y, na.rm = TRUE)
plot(dy[,2],dy[,1],main="Demand for Charter vs Pier Fishing",xlab = "share", ylab = "log rel price")
# Demand by income
inc_med1 <- quantile(x$income,0.8)
inc_med2 <- quantile(x$income,0.8)
dy3 <- matrix(NA,10,3)
price_old <- min(-x$ind) - 0.00000001
for (i in 1:9) {
price_new <- quantile(-x$ind,i/10)
dy3[i,1] <- price_new
dy3[i,2] <- mean(x[-x$ind > price_old & -x$ind < price_new & x$income < inc_med1,]$y, na.rm = TRUE)
dy3[i,3] <- mean(x[-x$ind > price_old & -x$ind < price_new & x$income > inc_med2,]$y, na.rm = TRUE)
price_old <- price_new
}
dy3[10,1] <- max(-x$ind)
dy3[10,2] <- mean(x[-x$ind > price_old,]$y & x$income < inc_med1, na.rm = TRUE)
dy3[10,3] <- mean(x[-x$ind > price_old,]$y & x$income > inc_med2, na.rm = TRUE)
plot(dy3[,2],dy3[,1],type="p",main="Demand for Charter vs Pier Fishing",xlab = "share", ylab = "index")
lines(dy3[,3],dy3[,1],type="p",col="red",pch=2)
legend(0.7, 0.1, c("High Income","Low Income"), col = c("red","black"), pch = c(2,1))
# Berry estimate
lm_iv <- lm(exp(ln_relp) ~ income, data = x) # intent-to-treat
summary(lm_iv)
lm_iv2 <- lm(y ~ income, data = x) # first stage
summary(lm_iv2)
iv1 <- lm_iv$coef[2]/lm_iv2$coef[2] # simple iv
inv_iv1 <- 1/iv1 # invert to get prices on shares.
inv_iv1
lm5 <- lm(y ~ exp(ln_relp), data = x) # for comparison purposes.
summary(lm5)