Correcting Standard Errors

This useful paper intends to make sure researchers understand what each of the methods for estimating standard errors is actually doing.

Stata Programming Instructions

The standard command for running a regression in Stata is:

regress dependent_variable independent­_variables, options

Clustered (Rogers) Standard Errors – One dimension

To obtain Clustered (Rogers) standard errors (and OLS coefficients), use the command:

regress dependent_variable independent_variables, robust cluster(cluster_variable)

This produces White standard errors which are robust to within cluster correlation (clustered or Rogers standard errors). If you wanted to cluster by year, then the cluster variable would be the year variable. If you wanted to cluster by industry and year, you would need to create a variable which had a unique value for each industry-year pair. These standard errors would allow observations in the same industry/year to be correlated (i.e. different firms), but would assume that observations in the same industry, but different years, are assumed to be uncorrelated. To allow observations which share an industry or share a year to be correlated, you need to cluster by two dimensions (industry and year). These instructions follow.

For most estimation commands such as logits and probits, the previous form of the command will also work. For example, to run a logit with clustered standard errors you would use the command:

logit dependent_variable independent_variables, robust cluster(cluster_variable)

Clustered Standard Errors – Two dimensions

The routines currently written into Stata allow you to cluster by only one variable (e.g. one dimension such as firm or time). Papers by Thompson (2006) and by Cameron, Gelbach and Miller (2006) suggest a way to account for multiple dimensions at the same time. This approach allows for correlations among different firms in the same year and different years in the same firm, for example. See their papers and mine for more details and caveats. I have written a Stata ado file to implement this estimation procedure. It runs a regression and calculates standard errors which account for two dimensions of within cluster correlation. The variables which record the two dimensions (e.g. a firm identifier and a time identifier) are specified in the required options: flcuster( ) and tcluster( ). There are also versions of the Stata ado file that estimates logit (logit2.ado), probit (probit2.ado), or tobit (tobit2.ado) models with clustering on two dimensions. The format is similar to the cluster2.ado command.

cluster2 dependent_variable independent_variables, fcluster(cluster_variable_one) tcluster(cluster_variable_two)

If there are multiple observations per firm-year (e.g. loan data sets which have multiple loans per firm in a given year), then the method described in my paper needs to be modified. In this case, instead of subtracting off the White variance matrix, you need to subtract off the variance matrix clustered by firm-year (i.e. for correlation among observations with the same firm AND the same year -- see Cameron, Gelbach, and Miller (2006) for details). The program has been modified to automatically check for this condition and use the correct third matrix. The program is also now compatible with the outreg procedure.

The code for estimating clustered standard errors in two dimensions has been written by Ian Gow, Gaizka Ormazabal, and Daniel Taylor in SAS and MatLab.

Fama-MacBeth Standard Errors

Stata does not contain a routine for estimating the coefficients and standard errors by Fama-MacBeth (that I know of), but I have written an ado file which you can download. The ado file fm.ado runs a cross-sectional regression for each year in the data set. The program allows you to specify a by variable for Fama-MacBeth. Thus if in stead of running T cross-sectional regressions, you could run N time series regressions by specifying the firm identifier as the byfm( ) variable. If the option is not specified, it uses the time variable (as set by the tsset comment) as the by variable. The program is also now compatible with the outreg procedure.

The form of the command is:

fm dependent_variable independent_variables, byfm(by_variable)

Prior to running the fm program, you need to use the tsset command. This tells Stata the name of the firm identifier and the time variable. The form of this command is:

tsset firm_identifier time_identifier

The program will accept the Stata in and if commands, if you want to do the regression for only certain observations. Judson Caskey, who showed me how to use the tsset command in the FM program, has also modified the program. His version reports the number of positive or negative coefficients and the number which are significant (and positive or negative). Another version (xtfmb.ado) has been written by Daniel Hoechle. To install this ado file from with in Stata type net search xtfmb. A full description is in the help file.

Newey West for Panel Data Sets

The Stata command newey will estimate the coefficients of a regression using OLS and generate Newey-West standard errors. If you want to use this in a panel data set (so that only observations within a cluster may be correlated), you need to use the tsset command.

tsset firm_identifier time_identifier

newey dependent_variable independent_variables, lag(lag_length) force

Where firm_identifier is the variable which denotes each firm (e.g. cusip, permn, or gvkey) and time_identifier is the variable that identifies the time dimension, such as year. This specification will allow for observations on the same firm in different years to be correlated (i.e. a firm effect). If you want to allow for observations on different firms but in the same year to be correlated you need to reverse the firm and time identifiers. If you are clustering on some other dimension besides firm (e.g. industry or country), you would use that variable instead. You can specify any lag length up to T-1, where T is the number of years per firm.

Fixed Effects

Stata can automatically include a set of dummy variable for each value of one specified variable.

The form of the command is:

areg dependent_variable independent_variables, absorb(identifier_variable)

Where identifier_variable is a firm identifier (e.g. cusip, permn, or gvkey) if you want firm dummies or a time identifier (e.g. year) if you want year dummies. If you want to include both firm and time dummies, only one set can be included with the absorb option. The other must be included manually (e.g. by manually including a full set of time dummies among the independent variables, and then using the absorb option for the firm dummies).

To create a full set of dummy variables from an indexed variable such as year you can use the following command:

tabulate index_variable, gen(dummy_variable)

This will create a set of dummy variables (e.g. dummy_variable1, dummy_variable2, etc), which are equal to one if the index_variable takes on its first value and zero otherwise (in the case of dummy_variable1).

A more elegant way to do this is to use the xi command (as recommended by Prof Nandy). This allows you to include a set of dummy variables for any categorical variable (e.g. year or firm), including multiple categorical values. To include both year and firm dummies, the command is:

xi: areg dependent_variable independent_variables i.year, absorb(firm_identifier)

where year is the categorical variable for year and firm_identifier is the categorical variable for firm. The coefficients on T-1 of the year variables will be reported, the coefficients on the firm dummy variables will not. To see the coefficients on both sets of dummy variables you would use the command:

xi: reg dependent_variable independent_variables i.year i.firm_identifier

Generalized Least Squares

When the residuals are correlated within a cluster, not only are the OLS standard errors biased but the slope coefficients are not efficient. One method for taking advantage of the additional information in the residuals (and generating more efficient estimates) is to estimate a random effects model using a generalized least squares approach. I used the xtreg command to estimate the GLS results reported in the paper.

The form of the command is:

xtreg dependent_variable independent_variables, i(firm_idenifier)

As with the regress commend, standard errors which are robust to within cluster correlation can be produced by including the option cluster(firm_idenifier)

xtreg dependent_variable independent_variables, i(firm_idenifier) cluster(firm_idenifier)

Bootstrapped Standard Errors

The Stata command bootstrap will allow you to estimate the standard errors using the bootstrap method. This will run the regression multiple times and use the variability in the slope coefficients as an estimate of their standard deviation (intuitively like I did with my simulations).

The form of this command is:

bootstrap “regress dependent_variable independent_variables” _b, reps(number_of_repetitions)

Where number_of_repetitions samples will be drawn with replacement from the original sample. Each time the regression will be run and the slope coefficients will be saved, since _b is specified. Both the average slope and its standard deviation will be reported. As specified, the bootstrapped samples will be drawn a single observation at a time. If the observations within a cluster (year or firm) are correlated, then these bootstrapped standard errors will be biased. To account for the correlation within cluster it is necessary to draw clusters with replacement oppose observations with replacement. To do this in Stata, you need to add the cluster option. In this case, the command is:

bootstrap “regress dependent_variable independent_variables” _b, reps(number_of_repetitions) cluster(cluster_variable)

Keywords: Andrew Johnston, Andrew, Johnston, Education, Wharton, vita, curriculum vitae, cv, economics, applied economics, economist, microeconomics, empirics, empirical economics