Linear Regression

Linear regression, often called "ordinary least squares" is probably the most commonly used method for statistical analysis. The chart above presents a scatter-plot of students from the 1966 NLS. Each circle represents a student in the data. The position represents their highest number of years of education and the log of their wages in 1976. The straight line through the plot is the linear regression line. It has a vertical intercept of 5.6 and a slope of 0.05. The line implies that each year of education is associated with approximately a 5% increase in wages.

The line is one that minimizes the sum of the square of the difference between the predicted line and the actual value. Thus "least squares".

The great benefit of this method is that it scales up very easily. In particular, adding more observable characteristics is straightforward using standard software packages like R. That said, one should be very cognizant of the very strong assumptions being made and careful interpreting the results.

We can interpret the results of linear regression as having implications for policy changes if we are willing to accept two very strong assumptions. First, we must accept that the data is generated by a close approximation to a linear model. It is possible to think of the regression as a simple linear approximation of the data. However, if we are interpreting the results in order to make changes to policy variables, this assumption interacts in important ways with our second assumption. The second assumption is called "unconfounding". This assumption states that once we have included the variables in the linear model, the remaining "error" or unexplained variation is not related to our policy variables. It is random measurement error.

The unobserved characteristics must be separable and independent of the policy variable.

Is it the case that adding one more year of schooling has the same effect on income when a student changes from 11 years to 12 years, as when they change from 15 years to 16 years? Even if we account for observed characteristics of the students such as parental income, gender and region of the country, are there still other unobserved reasons for getting more education that may be correlated with income? Does increasing education affect students in different ways? Is it more effective for students that are more "intelligent" or more "driven"?

The graph on the left represents the unconfoundedness assumption. It shows that there is a causal relationship between X and Y. It also shows that unobserved characteristics directly affect Y. However, there is no arrow from U to X. The unobserved characteristics that affect Y do not affect X. As long as the affect of U on Y is essentially at random, we can measure the relationship between X and Y.

The graph allows the possibility that X and U interact in their affect on Y. The measured relationship between X and Y may not represent the true relationship for different units affected by the policy.

R Code

# load data in from proximity.zip

# http://davidcard.berkeley.edu/data_sets.html

x <- read.delim("nls.dat",sep="",header=FALSE, stringsAsFactors = FALSE) # SAS File

y <- read.csv("names.csv",stringsAsFactors = FALSE,header = FALSE) # names.csv is constructed from the log file, a column of variable names.

colnames(x) <- as.vector(y$V1)

x$lwage76 <- as.numeric(x$lwage76)

# Linear Regression

lm1 <- lm(lwage76 ~ ed76, data = x)

summary(lm1)

# Plot

plot(x$ed76, x$lwage76, main="Income on Wages", xlab = "Education (Years)", ylab = "log wages")

abline(lm1) # Plots the linear regression (above) on the graph.

# Linear regression using qr.solve etc.

a <- cbind(1,x$ed76) # add a column of 1s in order to get an intercept.

b <- x$lwage76

a1 <- a[is.na(x$ed76) + is.na(x$lwage76)==0,] # remove NAs

b1 <- b[is.na(x$ed76) + is.na(x$lwage76)==0]

beta <- qr.solve(a1,b1) # b1 = a1 beta

beta

library(MASS) # the ginv or generalized inverse function is the MASS library.

beta2 <- ginv(a1)%*%b1 # the expression %*% means matrix multiplication.

beta2

beta3 <- (solve(t(a1)%*%a1))%*%t(a1)%*%b1 # to invert a matrix in R use solve. t() is matrix transpose.

beta3

Google Sites

Report abuse