stata

Stata links

FILE MANAGEMENT

Gentzkow and Shapiro (2014) “Code and Data for the Social Sciences: A Practitioner’s Guide.” - I strongly recommend reading this before embarking on your very first empirical research project. The guide introduces you to a lot of useful concepts of data management developed in computer science, which will save tons of time during an increasingly long journey of conducting a piece of empirical research in economics. The most important are Chapters 2, 4 and 5, which help you organize your data files and millions of your Stata do files (no joking, by the time you publish your empirical paper, you will have tons of Stata codes).

TUTORIALS

Essam and Hughes (2016) Stata Cheetsheets --- All the important Stata commands at one glance. (HT: Marc Bellemare)

Lembcke (2009) “Introduction to Stata” and “Advanced Stata Topics”--- These are the Stata course lecture notes for PhD students at the Department of Economics, LSE. Since 2004, each year’s course instructor has updated and expanded them. I took the course in 2004, but the current version of the lecture note is much more than what I learned at the course. You will learn a lot from this. In particular, “Advanced Stata Topics” touches on how to write and publish your own Stata programme, maximum likelihood estimation in Stata, and how to use Mata (Stata’s matrix programming language), the topics that are usually not covered in a Stata course for economists.

Using Stata to Analyze Survey Data by Nicholas Minot (IFPRI): This is an excellent introduction to Stata specifically tailored for would-be development economists.

Maybe useful:

A. Colin Cameron and P. K. Trivedi Microeconometrics: Methods and Applications

Germán Rodriguez “Stata Tutorial” Princeton University

Phil Bardsley, Kim Chantala, and Dan Blanchette "Stata Tutorial" University of North Carolina at Chapel Hill

Stata Starter Kit by UCLA Academic Technology Service

INTRODUCTION

What Stata can/can't do by A. Colin Cameron (Dept. of Economics, University of California, Davis)

ADO FILES

To install an ado file, type "ssc install xxx" (where xxx should be replaced with the name of the ado file) in your Stata interactive session.

DO FILES

Making do-files is essential because it allows other researchers to replicate your empirical analysis. It's increasingly become the norm among empirical researchers to make public on the website Stata do-files used to produce results in published papers. Here are some websites on how to make do-files.

Michael S. Hill (2015) "In Stata coding, Style is the Essential: A brief commentary on do-file style"

Stata Tutorial by Carolina Population Center, University of South Carolina

An Introduction to Stata by Aimee Chin at MIT

Stata section of Guide to Genetic Analysis by Centre for Integrated Genomic Medical Research (Links to example do-files are dead, but it contains some information on editor software.)

Using external text editors to write do files by Friedrich Huebler

RA Manual Notes on Writing Code, by Matthew Gentzkow and Jesse M. Shapiro (2012), offer the best practices in computer programming that are useful for writing Stata do files (and scripts for other software).

Stata help for timer: A useful command if you run a do file that contains a command to take very long to be executed (e.g. regression with a lot of fixed effects).

If you use Stata/MP on cluster computing facilities, see Stata Help: statamp if you use Stata/MP on cluster computing facilities.

READING FILES

Every data analysis begins with opening a data file. First, look at this website for jargons for data formats. (The description on rectangular files is wrong, though.)

Stata Help infiling: Official guide on which command to use for reading different types of data.

Excel

Excel files can finally be imported by a Stata command: import excel.

For earlier versions of Stata to read an Excel file, follow this blog entry. Make sure to use the forward slash (/) rather than the backslash (\) for the path name. It should then work.

Stata

There is a useful ado program named USE10 which allows you to read the Stata version 10 data with Stata version 9. Type “ssc install use10” to install it.

SPSS

To read SPSS data files, use the usespss ado. (HT: David McKenzie.)

CSV

If each data entry is separated by a comma (called the CSV format), use INSHEET.

If your data includes an identification number with more than 7 digits, make sure you include the double option to the insheet command. Read Stata Help for data_type for details.

Tab-delimited

If the separater is a tab or a space, use INFILE.

Fixed format

If the data file is in the fixed format (no separater between data entries; entries are identified by column numbers), it's more tricky. There are three cases:

(1) If it's a flat file (each single line represents one observation), see Stata: How to Write a Dictionary Program to Read Raw Data by the Electronic Data Service (EDS) at Columbia University;

(2) If it's a rectangular file (the fixed number of lines represent one observation), see "Example of a Program to Read Data with Multiple Records/Case" at the bottom of Stata: How to Write a Dictionary Program to Read Raw Data by the Electronic Data Service (EDS) at Columbia University;

(3) If it's a hierarchical file (a flexible number of lines represent one observation such as World Fertility Surveys), see Stata: How to Read Hierarchical Files in Stata by the Electronic Data Service (EDS) at Columbia University.

From scratch

To create a dataset from scratch, first type “drop _all” and then type “set obs #” where # is the number of observations in this new dataset. Then create variables by the generate command etc. For a small dataset, you can use the INPUT command to directly enter the data.

Multiple files in the same directory

To read many files in the same directory and append them all, see Append Many Files by UCLA.

EDITING DATA STRUCTURE

Before starting to edit data itself, you need to edit the structure of data files: reshape, append, and merge.

RESHAPE: Whenever you use the datasets downloaded from World Development Indicators, you need to do this.

Using Stata's RESHAPE command, by Amy Yuen at Electronic Data Center of Emory University General Libraries

APPEND/MERGE: Good empirical research often relies on the use of two or more completely different datasets. So you need to append or merge different datasets before starting analysis.

ISID: When you want to merge two datasets which do not share the common unique identifier but do share the same variables (e.g. birth date, birth region), the ISID command lets you check if a certain set of variables uniquely identify observations. See Stata Help on ISID.

Stata Tutorial Part 4: Manipulating Files, by Syracuse University Library

DATA PROCESSING

How to create dummy variables by Stata FAQs

Create a new dataset by hand by Carolina Population Centre, University of North Carolina

List of math functions by Stata Help - can be used in combination of generate command to edit variables.

List of operators by Stata Help

Date variables by Data and Statistical Services, Princeton University --- This webpage tells you how to convert date variables into different formats (e.g. convert the variables of year, month, and day into one date variable etc.).

To categorize observations by percentile bins, use the command xtile. See this Statalist message.

UNIQUE: Stata module to report number of unique values in variable(s) --- Sometimes this ado command is useful. For example, you may want to know whether a particular variable takes more than one value for each group of observations. To see the detail, type “ssc install unique” to install the ado file and then type “help unique” for its help.

REGEXM: useful if you want to identify observations whose string variable contains a particular set of letters.

Loop over all values of a particular variable: there is a lesser-known command LEVELSOF, creating a local macro r(levels) which contains the list of all values of the specified variable.

SUMMARY STATISTICS

ESTPOST - This is part of the ESTOUT ado file package, automatizing the process of creating a table of summary statistics. Highly recommended.

Section 6 (pages 33-43) of Using Stata for Survey Data Analysis by Nick Minot at IFPRI --- Very useful, especially if you are analyzing household survey data.

How to conduct a t-test for survey data, by UCLA Academic Technology Service --- Useful if each observation in your data needs to be weighted according to the sampling method. See also how to use the SVY command.

Generating Regression and Summary Statistics Tables in Stata: A Checklist and Code, by Matthew Groh (May, 2014) --- Provides an example do file that uses the MAT2TXT Stata module.

ESTIMATIONS

Overview of Stata estimation commands

Stata Textbook Examples: Econometric Analysis of Cross Section and Panel Data by Jeffrey M. Wooldridge, by UCLA Academic Technology Service --- First, find an example of the estimation method you want to conduct in Wooldridge's graduate econometrics textbook. Then log on to this webpage to see what Stata command does the estimation you want.

Beyond simple OLS estimation by UCLA Academic Technology Service - robust estimation, clustering, quantile regression, linear hypothesis testing, errors-in-variables regression (eivreg), censored/truncated data, SUR, multivariate regression, etc.

Fixed effects estimation

The XTREG command with the FE option (ie. fixed effects estimation) has recently been modified. See what’s new in Stata 10 (items 4, 5, and 7 in particular) and in Stata 11 (the fourth bullet point in particular).

Fixed Effects Estimation (xtreg command with fe option) by Stata FAQ - explains why there is a constant term in the estimation result table.

Differences among within, between, and overall R-squared obtained by the xtreg, fe command by Justin Smith (15 August 2006)

R squared in Fixed Effects Estimation by Stata FAQ - explains why reported R squared is different between xtreg, fe and areg. See also this note by Indiana University Information Technology Services. Theoretical background can be found in Hayashi's Econometrics textbook (page 333-4), for example. (This issue seems to be outdated with the xtreg command improved by Stata version 10 or higher.)

If you notice the areg command the xtreg command with the fe option produce different clustered standard errors from each other, read this.

Prais-Winsten panel regression: use the XTPCSE command. Examples include Rohlfs et al (2010).

Weighted least squares estimation

Weighted Least Squares when the variance of the error term is known by Stata Help

Choosing the Correct Weight Syntax by UNC Carolina Population Center - if you wonder what pweight, fweight, aweight, and iweight really mean.

Weighted Least Squares Regression by UCLA Academic Technology Service (See Deaton (1997) The Analysis of Household Surveys, pp.67-73, for the use of weighted least squares in the context of survey design.)

probit, logit, and other nonlinear regressions

MARGINS: a new command introduced since version 12, to report the average value of the predicted dependent variable by each specified value of regressors (if I understand corectly). Useful for interpreting estimated coefficients from nonlinear regressions, as explained by SSCC at University of Wisconsin-Madison.

INTEFF: this is an ado package to correctly estimate the magnitude and standard errors of the effect of an interaction term in nonlinear models such as probit and logit. See Ai and Norton (2003) for detail. This command, however, does not work if there are quite a few dummy variables as regressors. It seems the MARGINS and MARGINSPLOT commands supercede the INTEFF.

Event study

How to conduct an event study estimation with Stata by Data and Statistical Services, Princeton University

Attrition bias

Lee (2009)’s treatment effects bounds. In the case of attrition bias, this method is now the industry standard. Now you can easily do it in Stata with the leebounds command. New

Standard errors

Bootstrapping: See Lecture 4 (pages 6-8) in Programming in Stata, RLAB Data Service, London School of Economics.

X_OLS: Timothy Conley's standard error correction for spatial correlation. This is the standard way of calculating standard errors in the literature when you use the data where outcomes and regressors are spatially correlated.

Douglas Miller’s Stata code page contains a Stata do file to execute Cameron, Gelbach, and Miller (2008)’s Wild Bootstrap standard error clustering method, which is increasingly popular among applied microeconometric researchers when the number of clusters is small.

Matching estimation

CEM: Coarsened Exact Matching, by Iacus, King, and Porro (2008), for creating a control group whose observables are balanced against the treated group ex ante. Used by Azoulay, Zivin, and Wang (2010).

Matching Estimators ado file by Abadie, Drukker, Herr, and Imbens

Synth by Abadie, Diamond, and Hainmueller --- A method to estimate the treatment effect from observational data when only one unit is treated.

Pair-wise Mahalanobis matching with an optimal greedy algorithm: See page 209 of Bruhn and McKenzie (2009). This article’s replication data file (click “Download Data Set” on this webpage), contains a Stata code for this matching method.

AFTER EACH REGRESSION IS RUN...

How to interpret output tables that appear after executing estimation commmands such as summarize, regress, logistic, etc. by UCLA Academic Technology Service

reformat ado-file, by Sealed Envelop Ltd. - This ado-file is useful when you have tons of fixed-effects (e.g. country dummies) and are interested in coefficients on these dummies.

Stata Class 3, by Stas Kolenikov, Duke University - introduces commands after estimation for plotting residuals etc.

From version 10, you can save estimation results in the disk by the command estimates save. As a result, the ESTSAVE ado is no longer necessary to install.

parmest ado-file allows you to create a Stata data file of coefficient estimates along with t-values and p-values. By default, Stata does not store t-values and p-values after regressions. This ado-file is useful if you need to use t-values and/or p-values after each regression is run.

REPORT ESTIMATION RESULTS

ESTOUT - A great ado-file package to create a table of regression results either in the text file format, in the HTML format, or in the TeX format! It's more versatile than OUTREG2 (see below). It is slightly complicated but it's worth paying the fixed cost of learning how to use. To minimize the fixed cost, follow the following steps:

To install the package, see here.

First, learn how to use ESTSTO by reading this.

Then, learn how to use ESTTAB by reading this.

Only for fancier things to do, you need to learn ESTOUT (the more flexible version of ESTTAB) and ESTADD (the more flexible version of ESTSTO's ADDSCALARS option).

With the ESTOUT package, you can easily create a summary statistics table!

The ESTOUT package also allows you to include "YES" or "NO" to indicate whether a certain set of fixed effects are controlled for (a standard practice in labor economics type research). See this document.

Generating Regression and Summary Statistics Tables in Stata: A Checklist and Code, by Matthew Groh (May, 2014) --- If you prefer creating regression tables in the Excel format.

TABOUT - Seems to be a very useful ado for automating the process of creating any kinds of tables formatted to appear on an academic paper. Example Stata do files mentioned in this tutorial can be downloaded at the author’s website.

OUTREG2.ado - An improved version of OUTREG.ado (see below). It's less versatile than ESTOUT, but it's more flexible in producing a TeX file. One problem is that, after fixed effects estimation (areg or xtreg, fe), the nocons option does not work.

How to use outreg.ado, by Kellogg Research Computing, Northwestern University - probably the most useful explanation of outreg ado file, including the PDF file of outreg help file. When you want to use addstat option for reporting more than 10 statistics, outreg does not work properly. A solution can be found here (Statalist archives). (If you want to further convert the resulting EXCEL file into a LaTeX format, download EXCEL2LATEX here and extract the downloaded zip file into "C:\Documents and Settings\username\Application Data\Microsoft\AddIns" (where "username" is your own username). Then open the Excel and click "Tools - Add-Ins..." and check the box for Excel2Latex. You'll see a new small icon in tool bars. Select the table you want to convert and then click the icon. Now you can create a TeX file of your table.)

How to report multinominal logit regression results with OUTREG, by Statalist

GRAPHICS

Online Tutorial for Making Graphs by Stata Corp. - An excellent website in the sense that you can choose the visual image (rather than picking the words like “bar graphs”, “scatter plots”, etc.) to learn how to make various types of graph.

How to make various types of graph (Follow links below the heading of "Graphics") by UCLA Academic Technology Service - Useful if you want to make the twoway graphs.

BY option for GRAPH command by Stata Help - this is how to make graphs for each category (e.g. country by country).

BINSCATTER - A Stata package written by Michael Stepner, which allows you to create a scatter plot from (literally) millions of observations, by grouping observations into several intervals of the x variable and plotting the average value of the y variable for each group. (HT: David Seim)

Nonparametric regression curve in a scatter plot - search for "nonparametric".

Draw kernel density functions for each group in the same graph by UCLA Academic Technology Service

Guide to creating PNG images with Stata by Friedrich Huebler

How to create animated graphics using Stata, by Chuck Huber.

How to create a map from Stata by Friedrich Huebler

Drawing social networks in Stata with Netplot by Rense Corten --- if you are analyzing social network data.

PROGRAMMING

Programming in Stata, RLAB Data Service, London School of Economics: these are lecture notes for a Stata course at Department of Economics, LSE. Lectures 3 to 5 deal with how to make your own program with Stata (macro, looping, ado-file, etc.). Very useful.

How to display variable labels: See this Statalist message by Nick Cox on 27 May, 2010.

The CAPTURE command is useful when executing a do file, especially when you want to conduct different data processing steps depending on whether there is an error (which can be expressed as “if _rc==0” in the Stata code). See the paragraphs below the heading “If as a Way to Control Program Flow” in this webpage.

How do I run Stata in batch mode? (Stata FAQ): if you want to run a do file without launching Stata interactively in Unix

TROUBLESHOOTING

If you always type “set memory 900m” after launching Stata because you use a large dataset, read this.

If you run Stata on Windows and encounter an error message "op. sys. refuses to provide memory, r(909)", you may want to consider ditching Windows. Here's why.

If you encounter an error message "insufficient disk space, r(699)", see this Stata FAQ article.

If you encounter a warning message “Warning: variance matrix is nonsymmetric or highly singular”, see this post in Statalist by Jeff Pitblado of Stata Corp.

If you encounter an error message “could not rename c:\ado\plus\stata.trk to c:\ado\plus\backup.trk r(699);” when you try to install an ado file by the “ssc install” command, read pages 47-48 of Lembcke (2009) “Introduction to Stata”. Unfortunately, this method does not change the Stata setting permanently. Everytime you use an ado file, you have to do this.

FROM STATA TO OTHER SOFTWARE

Export tables to Excel, written by Kevin Crow on The Stata Blog.

How to transform dta file into csv file, by UCLA Academic Technology Service. If data contains many decimal places, make sure to use the format command before the outsheet command so that Stata won’t randomly round up values. If you don’t need the top row containing variable names, use the noname option.

Order command by Stata Help - if you want to change the order of variables in the table you create from the Stata dataset.

How to edit Stata graphs in Microsoft Word, by Stata FAQ

Stata tools for Latex, by UCLA Academic Technology Service - for those of you who write empirical papers with LaTeX.

TEXTBOOK EXAMPLES

Stata commands for examples in Wooldridge's graduate level textbook Econometric Analysis of Cross Section and Panel Data, by UCLA Academic Technology Service

Stata commands for examples in Wooldridge's undergrad level textbook Introductory Econometrics: A Modern Approach, by Boston College Academic Technology Support

Stata commands for Greene's textbook Econometric Analysis (4th ed.), by UCLA Academic Technology Service

Accessible readings behind Stata commands

IVREG2

Murray, Michael P. (2006) "Avoiding Invalid Instruments and Coping with Weak Instruments," Journal of Economic Perspectives, 20(4), p. 128.

CLUSTER option for REGRESS

Deaton (1997) The Analysis of Household Surveys, pp.74-77.