It’s A Cluster!

Say you want to estimate the treatment effect, but the data has unobserved confounders. Luckily for you, these confounders are correlated with observed groups. You can use fixed effects!

Say you want to determine the effect of taking algebra has on lifetime earnings. You have administrative data on earnings and algebra classes. But kids at certain schools may have more access to algebra classes and lifetime earnings. Using school fixed effects may alleviate the confounding.

OK, you have estimated your fixed effects now you want to say something about the uncertainty around your treatment effect estimate. What do you do? Do you report standard standard errors? Robust standard errors? Sandwich standard errors? The last one is a trick question, because sandwich standard errors are just robust standard errors with mayo added.

Cluster! You should consider clustering your standard errors at the group level, at the school level in our example. One reason for doing this is the argument that unobserved characteristics are not distributed independently within the group. But what if they are? Should you still cluster? Its a cluster!

The figure above presents the distribution of the simulated estimated standard errors under different assumptions. The gray line gives the actual standard deviation of the estimated coefficient in the simulations. Standard standard errors tend to estimate too high, as do robust standard errors, clustered standard errors are more correct on average but vary a lot! Note that in the simulations the distributions are different by group, but there is independence within the group. The simulation and code is presented below.

What is the Correct Error to Report?

The variance of our estimate is a function of the variance of the unobserved term. We can and will invoke the Central Limit Theorem. The CLT tells us that when there is a large number of large random samples, the estimate from each of these samples will be distributed normally with a mean equal to the true value and a variance equal to the true variance of the unobserved term divided by the number of observations in the sample. The only problem with invoking the Central Limit Theorem is that we don't know the true value or the true variance. So we will approximate the distribution of the sample estimate by assuming that it is distributed normally with a mean equal to the observed estimate and a variance equal to the observed variance.

The figure on the left compares the simulated distribution of the estimated treatment effect to the distribution of the estimate assuming a normal distribution with mean equal to the true value of the treatment and standard deviation equal to the mean of the estimated standard deviation from one of the three standard methods. The black squiggly line is the true distribution of the simulated estimate, then the three estimated distributions all lie on top of each other.

The figure on the right is a sample of the estimated distribution of the treatment effect from the simulations. The black line is the density plot of the simulated estimates of the treatment effect. The red line is the estimated distribution where the mean and standard deviation are set at the average from the simulations. Any pink line is the estimated distribution for a particular simulation given the estimated treatment effect and the estimated standard deviation. If you are wondering how to interpret the figure, it says that your estimate of the distribution of the estimate could be all over the effing place.

I don't remember metrics professors showing the picture on the right.

Standard Standard Errors

But which observed variance?

The unobserved characteristic is simply the difference between the observed outcome and the predicted outcome. It is the residual.

If we are willing to assume that the unobserved term for each observation comes from the same distribution, then we can estimate the variance using the sum of the squared residuals.

Here is the basic model from Abadie et al (2017).

R Code

# create a data set with an unobserved term that varies by group. The unobserved term causes confounding.

set.seed(123456789)

a <- 2

b <- -3 # treatment effect

J <- 100 # number of groups

N_j <- 10 # size of groups

sigma <- 2

sigma_up <- 2

group <- as.vector(t(matrix(rep(1:J, N_j), nrow=J)))

upsilon <- rnorm(J, sd=sigma_up) # unobserved term affecting group

Upsilon <- t(matrix(rep(upsilon, N_j), nrow=J))

x <- runif(J*N_j) + 0.2*as.vector(Upsilon)

X <- matrix(x, nrow=N_j)

epsilon <- rnorm(J*N_j, sd=sigma) # unobserved term.

Y <- a + b*X + Upsilon + matrix(epsilon, nrow=N_j)

y <- as.vector(Y)

data1 <- data.frame(y = y, x = x, group = group)


Group <- matrix(0, J*N_j, J)

# create fixed effects

for (j in 1:J) {

Group[group==j, j] <- 1

}


# OLS

X1 <- cbind(x, Group)

beta_hat <- solve(t(X1)%*%X1)%*%t(X1)%*%y

epsilon_hat <- y - X1%*%beta_hat


# Standard standard error

sigma2_hat <- sum(epsilon_hat^2)/(J*N_j - (J+1))

Sigma2_hat <- solve(t(X1)%*%X1)*sigma2_hat

Sigma2_hat[1,1]^(0.5)

[1] 0.2361346

lm1 <- lm(y ~ x + Group -1, data=data1)

summary(lm1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

x -2.59565 0.23613 -10.992 < 2e-16 ***

Sandwich Errors

Peanut butter and jelly is obviously a major sandwich error, but that is not what we mean. The "sandwich" refers to the linear algebra and what it looks like. The economist, Hal White, suggested that instead of assuming that the unobserved characteristic for each observation comes from the same distribution, assume each unobserved characteristic is drawn from a different distribution.

The analogy estimate of the variance for a particular observation is um, zero. White suggests it is the square of the residual for that observation. Which he acknowledges is not going to give a very accurate estimate.

Nevertheless, White notes we don't need throw out the baby with the bath water. Our standard assumption of independent and identical distribution is dropped but the independence assumption is kept. White also points out that we only need the "average" of the individual estimate.

Note that if we are going to use sandwich standard errors we should probably make an adjustment to account for the fact that our estimate of the variance is based on our estimated of the fixed effects. We call this a degrees of freedom adjustment.

# Sandwich standard error

Omega_hat <- diag(as.vector(epsilon_hat^2))

Sigma2_hat2 <- solve(t(X1)%*%X1)%*%t(X1)%*%Omega_hat%*%X1%*%solve(t(X1)%*%X1)

Sigma2_hat2[1,1]^(0.5)

[1] 0.2219312

df_adj <- (N_j*J)/(N_j*J - (1 + J))

(Sigma2_hat2[1,1]*df_adj)^(0.5)

[1] 0.2340661

# compare to the sandwich library results.

require(sandwich)

vcov(lm1)[1,1]^(0.5)

[1] 0.2361346 # standard standard error

vcovHC(lm1, type="HC0")[1,1]^(0.5)

[1] 0.2219312

vcovHC(lm1, type="HC1")[1,1]^(0.5)

[1] 0.2340661

Note that in this particular simulation the sandwich standard errors are smaller than the standard standard errors.

Clustered Standard Errors

It is said that clustered standard errors are just a generalization of sandwich standard errors. I'm not sure what they mean. Anyways.

Instead of allowing each unbserved characteristic to come from a different distribution, we restrict it so that all the unobserved characteristics associated with the same group, e.g. the same school, come from the same distribution. Moreover, within the group we will drop the independence assumption.

# Cluster Standard Errors

Omega_hat2 <- matrix(0,J+1,J+1)

for (j in 1:J) {

index_j <- which(X1[,j+1]==1)

epsilon_hat_j <- epsilon_hat[index_j]

X_j <- X1[index_j,]

Omega_hat2 <- Omega_hat2 + t(X_j)%*%epsilon_hat_j%*%t(epsilon_hat_j)%*%X_j

}

Sigma2_hat3 <- solve(t(X1)%*%X1)%*%Omega_hat2%*%solve(t(X1)%*%X1)

Sigma2_hat3[1,1]^(0.5)

[1] 0.2322667

((J/(J-1))*Sigma2_hat3[1,1])^(0.5)

[1] 0.2334368

# compare to the clubSandwich library results.

require(clubSandwich)

(vcovCR(lm1, cluster=group, type="CR0")[1,1])^(0.5)

[1] 0.2322667

(vcovCR(lm1, cluster=group, type="CR1")[1,1])^(0.5)

[1] 0.2334368

Monte Carlo Simulation

# Monte Carlo Simulation

K <- 10000

beta_mat <- Sigma2_mat <- Sigma2_mat2 <- Sigma2_mat3 <- matrix(NA,K,1)

for (k in 1:K) {

upsilon <- rnorm(J, sd=sigma_up)

Upsilon <- t(matrix(rep(upsilon, N_j), nrow=J))

x <- runif(J*N_j) + 0.2*as.vector(Upsilon)

X <- matrix(x, nrow=N_j)

epsilon <- rnorm(J*N_j, sd=sigma)

Y <- a + b*X + Upsilon + matrix(epsilon, nrow=N_j)

y <- as.vector(Y)

lm1 <- lm(y ~ x + as.factor(group) - 1)

Group <- matrix(0, J*N_j, J)

for (j in 1:J) {

Group[group==j, j] <- 1

}

X1 <- cbind(x, Group)

beta_hat <- solve(t(X1)%*%X1)%*%t(X1)%*%y

epsilon_hat <- y - X1%*%beta_hat

beta_mat[k] <- beta_hat[1]

Sigma2_hat <- solve(t(X1)%*%X1)*sum(epsilon_hat^2)/(J*N_j - (J+1))

Sigma2_mat[k] <- Sigma2_hat[1,1]

print(Sigma2_hat[1,1] - vcov(lm1)[1,1]) # checks that there is not a big difference between algos.

Omega_hat <- diag(as.vector(epsilon_hat^2))

Sigma2_hat2 <- solve(t(X1)%*%X1)%*%t(X1)%*%Omega_hat%*%X1%*%solve(t(X1)%*%X1)

Sigma2_mat2[k] <- df_adj*Sigma2_hat2[1,1]

print(df_adj*Sigma2_hat2[1,1] - vcovHC(lm1, type="HC1")[1,1])

Omega_hat2 <- matrix(0,J+1,J+1)

for (j in 1:J) {

index_j <- which(X1[,j+1]==1)

epsilon_hat_j <- epsilon_hat[index_j]

X_j <- X1[index_j,]

Omega_hat2 <- Omega_hat2 + t(X_j)%*%epsilon_hat_j%*%t(epsilon_hat_j)%*%X_j

}

Sigma2_hat3 <- solve(t(X1)%*%X1)%*%Omega_hat2%*%solve(t(X1)%*%X1)

Sigma2_mat3[k] <- (J/(J-1))*Sigma2_hat3[1,1]

print((J/(J-1))*Sigma2_hat3[1,1] - vcovCR(lm1, cluster=group, type="CR1")[1,1])

print(k/K)

}

I highly recommend playing around with the group sizes and the number of groups to see what happens to the standard error estimates. In this particular simulation the standard deviation of the estimated parameter is smaller than the average of the estimated standard error, although the average clustered error is closest. But note the variance in the estimated standard errors. The clustered standard error estimate has much more variance than the other approaches.

> sd(beta_mat)

[1] 0.2283741

> mean(sqrt(Sigma2_mat)) # standard

[1] 0.2308837

> mean(sqrt(Sigma2_mat2)) # robust

[1] 0.2307058

> mean(sqrt(Sigma2_mat3)) # clustered

[1] 0.2300964

> sd(sqrt(Sigma2_mat))

[1] 0.006581632

> sd(sqrt(Sigma2_mat2))

[1] 0.008459119

> sd(sqrt(Sigma2_mat3))

[1] 0.01771324


Plot Code

x1 <- as.vector(sort(beta_mat))

plot(density(beta_mat, bw=0.005), col="black", axes=FALSE, xlab="beta_hat", ylab="", main="", lwd=2)

lines(x1, dnorm(x1, mean=b, sd=mean(sqrt(Sigma2_mat))), col="red", lwd=2)

lines(x1, dnorm(x1, mean=b, sd=mean(sqrt(Sigma2_mat2))), col="green", lwd=2)

lines(x1, dnorm(x1, mean=b, sd=mean(sqrt(Sigma2_mat3))), lwd=2,col="blue")

axis(1, col="gray")

legend("topright", c("MC", "Standard", "Robust", "Cluster"), col=1:4, lwd=2, box.lwd = 0)


plot(density(beta_mat, bw=0.005), col="black", axes=FALSE, xlab="beta_hat", ylab="", main="", lwd=2)

lines(x1, dnorm(x1, mean=mean(beta_mat), sd=mean(sqrt(Sigma2_mat))), col="red", lwd=2)

for (k in 1:50) {

lines(x1, dnorm(x1, mean=beta_mat[k], sd=sqrt(Sigma2_mat[k])), col="pink", lwd=1)

}

axis(1, col="gray")



plot(density(beta_mat, bw=0.005), col="black", axes=FALSE, xlab="beta_hat", ylab="", main="", lwd=2)

lines(x1, dnorm(x1, mean=mean(beta_mat), sd=mean(sqrt(Sigma2_mat2))), col="green", lwd=2)

for (k in 1:50) {

lines(x1, dnorm(x1, mean=beta_mat[k], sd=sqrt(Sigma2_mat2[k])), col="light green", lwd=1)

}

axis(1, col="gray")


plot(density(beta_mat, bw=0.005), col="black", axes=FALSE, xlab="beta_hat", ylab="", main="", lwd=2)

lines(x1, dnorm(x1, mean=mean(beta_mat), sd=mean(sqrt(Sigma2_mat3))), col="blue", lwd=2)

for (k in 1:50) {

lines(x1, dnorm(x1, mean=beta_mat[k], sd=sqrt(Sigma2_mat3[k])), col="light blue", lwd=1)

}

axis(1, col="gray")


plot(density(Sigma2_mat3^(0.5)), col="blue", main="", xlab="standard error estimate", axes=FALSE, ylab="", lwd=3, ylim=c(0,60))

lines(density(Sigma2_mat2^(0.5)), lwd=3, col="green")

lines(density(Sigma2_mat^(0.5)), lwd = 3, col="red")

abline(v=sd(beta_mat), lwd=3, col="gray")

text(0.20,57, "True (simulated)", col="gray")

axis(1, col="gray")

legend("topright", c("Standard", "Robust", "Cluster"), col=2:4, lwd=3, box.lwd = 0)

Left is the sample of estimated parameter distributions with robust standard errors and the right figure is with clustered standard errors.