(page under construction, May 2019)
NYUAD Students: see your NYU Classes course page for materials.
These are materials that I have put together as part of Stata-based courses at New York University Abu Dhabi. What you see here is generally what I teach to students in Statistics for the Social and Behavioral Sciences (intro stat) and/or Data Analysis: Economics (intermediate level data analysis). I also teach more advanced programming for upper level students and workshops on particular data types/issues. This Stata User Guide is a (draft) user-friendly guide to basic Stata to accompany a first treatment of regression analysis.
Each of the major statistical packages used in the social sciences has advantages and disadvantages across two dimensions: ease of use and flexibility. For instance, SPSS is very easy to use, but is relatively limited in terms of capability. R is incredibly versatile, but difficult for beginners to learn. Stata is at roughly the midpoint for each. One nice thing about Stata is its user interface. It lays things out cleanly and if you don't want to learn code (read: if you don't want to do things properly), you can execute most simple analyses using the drop down menus.
The main screen of Stata (what you see when you first open the program) has six useful components.
There are many good textbooks out there that include some level of Stata coding as part of statistics or research methods courses. I like "Microeconometrics Using Stata" by Cameron and Trivedi for students who need to conduct their own research or Stata's "A Gentle Introduction to Stata" by Alan C. Acock for students who just need to learn the basics and haven't taken previous statistics courses.
To get assistance with a particular command, you can use Stata's help function. For example, to get help with running an Ordinary Least Squares (OLS) regression, you would type the following into the command line:
help regress
A window will then pop up with Stata's reference manual entry on the command. In particular, it will show you correct syntax (how to set up the code) and any available options. The syntax entry will tell you how to type the line of code in general. It usually takes the form
command required [optional] [if] [in] [, option]
where "command" is what you want Stata to do and "required" is usually a variable or list of variables (Stata uses "depvar" to mean dependent variable, "indepvar(s)" to mean independent variable(s), and "varlist" to mean list of variables). Anything inside of square brackets is optional, meaning not strictly necessary to get the code to execute, but necessary if you want anything other than the default command. "if" and "in" let you execute code conditionally, for instance if you want to only run your regression for a subset of your data. [, option] (note that the comma is important here) lets you specify other customization of the command. You can often execute multiple options for any one command, but you only need the comma once. Anything after the comma, Stata will know is an option.
This, though, assumes that you know the name of the command. If you don't know the name of the command, then the help function is not of much use. It's easier to revert to your textbook or to the internet than to sift through the Stata reference materials unguided. For instance, if you did not know that the command for OLS regression was "regress," then a Google search for "how to run regression in Stata" would be the easiest path.
Before you start playing with data, and if this is something that you are likely to do on a regular basis, then it would be a very good idea to organize your machine efficiently. Create a new folder for data, then a subfolder for each project. Then within your project subfolder, create another level of subfolders for different versions. This will not only make it easier for you to find your data and related Stata files, but also help you keep track of changes, assuming you don't complete your entire project in one sitting. If you're writing a full research article it is pretty much unheard of to do your entire analysis at once. Sometimes you will have coauthors/collaborators/professors who will want/need to see your code at various stages. They will invariably have comments or suggest different statistical models/approaches and some will even send you code of their own. Keeping old versions of your code and/or data will allow you to go back to what you did in a previous attempt without completely redoing the process.
Stata has several proprietary file formats that each serve different functions. Some of these can only open (or at least only open easily) in Stata. Others can at least be viewed in other programs.
.dta: This is Stata formatted data. NOTE: what you don't see here is that the version of Stata you have matters. New versions of Stata can read old Stata data, but old versions of Stata cannot read new Stata data. There are ways around this that we will discuss later.
.do: These are called "do files" and contain a full list of code that you want Stata to execute at once. Outside of Stata, you can view these in Notepad or any program that can read .txt files. If data analysis were a trip to the coffee shop, then the .do file would be your order.
.log: These are called "log files" and contain not just the code, but also the output that Stata generated. Keeping with the coffee shop analogy, the .log file would be your receipt taped to a nice hot cup of coffee. .log files would more appropriately be called ".did" files, because they contain what Stata did.
.gph: These are Stata graph files. A nice thing about Stata is that with these files, if you later want to edit the appearance of a graph, you can open these directly, without having to reopen your data or re-run your code. Note though, that if you want to change the substance rather than just the appearance, you would have to re-generate the graph completely.
.ado: These are user-created function files. Stata can do many things, but sometimes there are functions we need that aren't part of Stata's official release. Often these are things that only a relatively small subset of users need, so kindly nerds provide them for us.
Practically speaking, you can execute most basic analysis from the command line in Stata or using drop-down menus in the GUI. I strongly advise against this as among other reasons, (1) it will only get you so far, (2) it's inefficient, (3) you'll often need other people (professors/coauthors/coworkers) to replicate your work, and (4) did I mention it's inefficient? You will be much better served executing a batch of code at once than hunting through drop down menus to run model after model. An additional problem specific to students is that when you try to do something via the drop down menus, Stata presents a host of different options. Each of these options may correspond to a different type of model or complication that you haven't discussed in class yet. You might see an option, think to yourself "that sounds like something I might need," and select it without understanding why it's there. You can end up running the wrong model or at least losing points on a graded assignment.
Instead, work from a .do file. This lets you control what Stata is doing and lets you easily re-generate your results if you have to do the work in more than one sitting or go back and make changes later. The only thing you will need to change for the above .do file to run is the address of the current working directory to the address of wherever you stored the above example data file. The associated codebook lists what variables are contained in the data and how each variable is defined.
Within the .do file, it's important to use comments. By that, I mean leave notes for yourself or anyone else who might need to understand your code. Comments are bits of text that you want to tell Stata (or any other program) to treat as regular text and not as code. There are a few ways to do this: comment out an entire line, comment out the end of a line, or leave a comment in the middle of a line. You can comment out an entire line by starting the line with a star symbol. You can leave a comment at the end of a line of code by using two / symbols. It's pretty uncommon, but you can also put a comment in the middle of a line by starting with the symbol /* then writing your comment, followed by */ to end the comment.
* This entire line is a comment. There's no code here.
code code code // The end of this line is a comment.
code code /* This is a comment. */ code code
Stata .dta data files are opened via the "use" command:
use filename.dta
When working with textbooks or online examples, you will also see "sysuse." This is a special command that lets you open some pre-canned datasets that come automatically installed with you copy of Stata. This command works the same way as the "use" command. The only difference is that you didn't have to find and download a dataset first. To see a list of datasets that came pre-installed with Stata, type:
help dta_examples
into the command line.
ASCII, .csv, .tab, .tsv, and .dat can be brought into Stata via the "import delimited" command:
import delimited "filename.ext"
where "ext" is the extension of your particular file. Stata can then figure out on its own how your data is set up and how it should be brought in as variables.
It's pretty common to see data posted in Excel spreadsheet form. For these, there is a specific command "import excel."
import excel "filename.xlsx"
However, you will generally need more than this, especially if the first row of your .xlsx file contains variable names. Excel spreadsheets often contain multiple pages called sheets, where different data is stored. If this is the case, then you need to tell Stata where to look.
import excel "filename.xslx", sheet("firstsheet") firstrow
"firstsheet" is the name of the sheet you want to import (generally found on the bottom left of an Excel file) and firstrow tells Stata that the first row of the file contains variable names.
Import excel is limited to files 40MB in size or smaller. I suspect that this is because the process that Stata uses to import the data is unstable. The bigger the file, the more likely Stata is to crash. There is a way around this limit, but it's not well documented:
clear all
set excelxlslargefile on
import excel filename.xlsx, firstrow
save newfilename, replace
These four lines of code do four separate things. (1) Clear out everything in Stata'a active memory. (2) Tell Stata to allow you to open big files. (3) Import the data. (4) Save the file in Stata's .dta format. Once you execute all four steps, you will have your data in .dta format. You can then use the .dta formatted file when you actually want to do your analysis, rather than having to redo these four steps. Why are just four lines of code a big deal? Because they will often cause Stata to crash, especially for really big files. You just need the code above to work once, then you can work with a .dta file that won't cause the same crashing problem.
Some data is posted in .txt (text) format, generally if whoever posted the file is trying to save space. Text files are efficiently written, so they can tile kilobytes or megabytes, rather than gigabytes. Text files can also be imported with the "import delimited" command, but the exact setup will depend on how variables/observations are delimited. By that, I mean how the file distinguishes observations from each other. If a file just listed 2394872309487234, then how would you know if those numbers are meant to be one huge value for one variable, one observation of sixteen different variables, four observations of four different variables, or something else? A delimiter is a character that tells programs how the data is supposed to be broken up. For instance, comma delimited data might be 2,3,9,4,8,7,2,3,0,9,4,8,7,2,3,4. It could also be space delimited, tab delimited, semicolon delimited, etc. Really, it could be any symbol, as long as it's used systematically. If our data is delimited with semicolons, the code would be:
import delimited "filename.txt", delimiters(";")
If your data source is less kind to you and none of the above work, try looking into "infile," "infix," and "odbc." These are less commonly needed. If you encountered an error while working with one of the previous commands, try to troubleshoot it first, before moving to these commands. I won't even bother explaining them here.
For most users, there are two types of variables in Stata: string and everything else. Observations of string variables can contain words, numbers, symbols, or any combination thereof. In order to run statistical analysis on string variables, you need to convert them first. A variable stored as str##, where ## is a number, are string variables that contain up to ## characters. The format strL is also a string. The L stands for "long." In the data editor/browser, these variables are shown in red. Other storage types like byte, int, float, double, and so forth, contain numerical data. These types mark different storage types and how much space they take up. In the early days of Stata, when space was expensive, storage type mattered a lot more than it does now. For instance, byte means that an observation takes up one byte of data. Now that most users won't blink at the prospect of using programs or accessing data that takes up gigabites of space, these are not as important. Depending on what you are trying to do, you generally (there are exceptions) do not need to care about numerical storage type. The type of a numerical variable will dictate the appropriate form of analysis, but not what Stata will or will not allow you to do at all.
Now that we've brought some data into Stata, a good first step is to just look at the raw data and some summary statistics. Using the drop down menus, if you go to Data --> Data Editor, there are two options: "edit" and "browse." Select the "browse" option. This will bring up a spreadsheet that lets you easily scroll through your data. If you select "browse," then the dataset is locked. You can look at the data, but not make any changes. If you select "edit," then you get the same spreadsheet, but you can make changes. 99% of the time, you want "browse." Why? Because there is no "undo" button in Stata. If you accidentally make an incorrect change and don't immediately notice, you'll either have to go back and hunt for your mistake later or start your changes over from the beginning. If you do want to make edits to variables or values, you should do this via your code. I talk about how to do this in the "creating and defining variables" section.
Next, we can get a list of available variables and some useful information with the describe command.
describe
This will give you the variable name, storage type, and often a more detailed explanation of what the variable means as part of the "label." This is useful, because for larger datasets, whoever posted the data may just call variables var1, var2, var3, ... varN. You would then either need to look at the label or a separate codebook (usually a PDF or Excel file) to figure out what each variable is measuring.
To get some quick descriptive statistics, we can use the summarize command.
summarize
This will give us the number of non-missing observations for each variable, the mean, standard deviation, minimum value, and maximum value. To get some further detail on each variable, we can use the code:
summarize varname, detail
where "varname" is the name of the variable you want to examine.
If we have a categorical variable, we can get a table of frequencies with the tabulate command:
tabulate varname
If we include a second variable, we can get a cross-tabulation or "cross tabs" showing how many observations take each pair of values for the two variables
tabulate variable1 variable2
Finally, we may want to just look at some observations of our dataset without going through the entire spreadsheet
list varlist in 1/10
where "varlist" is the list of variables you want Stata to show and "in 1/10" means "show me what is stored in these variables for observations one through ten." Those could be any numbers, but generally you'll only need/want to look at a small number of observations, otherwise you might as well just go to the data browser.
Working with datasets that someone else created is all well and good, but you will almost always need to create new variables or re-code existing ones in order to accomplish a given task. Suppose that you have a variable called "x" containing whole numbers from 1 to 10 and you want to code a new variable "xbinary" coded 0 if x is less than or equal to 5, and 1 if x is greater than five. We can do so as follows:
generate xbinary = .
replace xbinary = 0 if x <= 5
replace xbinary = 1 if x > 5
The first line creates a new variable called xbinary. The dot means that initially the variable will just be empty. The second line codes xbinary as 0 if x is less than or equal to 5 and the third line codes xbinary as 1 if x is greater than 5. This approach is identical to:
generate xbinary = 0 if x <= 5
replace xbinary = 1 if x > 5
So why did I start with an extra step in the first version? Because when you start coding more complicated variables, you'll be less likely to make a mistake if you define the variable first and then define the values. Note that to type "less than or equal to," you have to type the less than sign, followed by the equal to sign. More on this in the "Coding Operators vs. Math Operators" section.
In the above example, we defined our new variable to take particular number values if a condition was met. In place of the numbers, we can also use mathematical expressions. Suppose that we have a variable in our dataset called "birthyear" for the year in which a person was born and we want to use it to create a variable for approximate age in years.
generate age = birthyear - 2018
More generally, this is
generate varname = expression
The expression can be any mathematically legal operation (addition, subtraction, multiplication, division, and many others).
We can also generate variables with more complex operations using "extended generate."
egen variable = expression
For example, suppose we have three variables, "var1," "var2," and "var3," and suppose that we want to create a new variable containing the average value of the three for each observation. This can be done with the code:
egen avgvar1 = rowmean(var1 var2 var3)
This could also be done with the code:
generate avgvar2 = (var1 + var2 + var3)/3
But! What would happen if you had missing data? The result would be different based on each and look like this:
observation | var1 | var2 | var3 | avgvar1 | avgvar2
1 | 4 | 3 | 2 | 3 | 3
2 | 2 | 5 | . | 3.5 | 2.33
3 | 1 | 4 | 7 | 4 | 4
For observations 1 and 3, both worked perfectly, but not for observation 2. For observation 2, the value of var3 is missing (as shown by the dot). In most cases, if Stata sees missing data and you haven't directly told the program how to handle it, Stata will just ignore the value. For the generate command, it sees "(2 + 5 + .)/3" and interprets that as "(2 + 5)/3." It has ignored the missing value, but doesn't know that you're trying to calculate the average. It just sees "divide this thing by 3" and it obeys. With the extended generate command, it sees "find the mean value of 2, 5, and ." Because you haven't forced it to divide by three, it recognizes that there are only two elements in the list, so it divides by two instead.
The extended generate command can execute many calculations and automatically take account for missing data. For a list of what's available, see:
help egen
While learning how to create data, it's similarly useful to learn how to destroy data. The command
clear
deletes all variables in Stata's active memory.
clear all
does the same, but also gets rid of other objects you may have stored. If we instead want to just get rid of individual variables, we can do this with the drop command.
drop var1 var2 var3
This can be done one variable at a time or with a long list of variables. If your dataset initially has a large number of variables, but you only care about a few, then
keep var1 var2 var3
will keep only the variables you list and drop the rest. Each of these can be executed conditionally. Suppose we have data on individuals ranging in age from 10 to 99, but we only want to study those who are at least 18 years of age.
drop if age < 18
will accomplish this task. Be careful dropping values though, lest you accidentally drop something you might need later.
By this point in life, you have become well accustomed to mathematical operators like +, -, =, and the like. The problem is that in virtually all programming contexts, these symbols may have programming meanings that differ from their mathematical meaning. For other symbols, there wasn't always an easy way to type a given symbol on most keyboards, so we have to use a combination of symbols instead. Other symbols are logical/programming symbols that we need to define. Each of these could be applied above to the discussion of variable generation.
+ "plus" - "minus" / "divided by" * "multiplied by"
^x "to the power of x" = "assign the value" == "equal to"
> "greater than" >= "greater than or equal to"
< "less than" <= "less than or equal to" & "and" | "or"
The last two, "and" and "or," can be applied to check if multiple conditions are true. For example:
generate xbinary2 = .
replace xbinary2 = 0 if (x > 3 & x < 7)
replace xbinary2 = 1 if (x <= 3 | x >= 7)
This creates a new variable coded 1 if the value of x is from 3 to 7, 0 otherwise. You could execute the same thing with
generate xbinary3 = .
replace xbinary3 = 1
replace xbinary3 = 0 if x > 3
replace xbinary3 = 1 if x >= 7
But this second approach will be far less efficient if you have complex operations to execute and is more likely to lead you to make a mistake. In any programming task, there will be multiple ways to accomplish the same goal, but (1) in the long run you want to program as efficiently as possible to save yourself time and effort and (2) if you make a mistake, you want that mistake to be as easy to identify as possible.
It's pretty common for students in a first stats class to not like textbook examples or working with fake/contrived data sets that the professor has generated, but simulating data is actually an incredibly useful skill. It can help you better understand probability distributions, what results particular formulas/functions/commands generate, and in some cases simulations are necessary to evaluate your models. If you control what the data looks like, then you know what the formula or model should produce. This is a very brief introduction to the topic, but trust me that it's useful and you will need it sooner or later. Within Stata, there are a few ways in which to simulate data, but let's introduce two: one based on observation number and one based on probability distributions. Before we start simulating data though, we should clear Stata's active memory and then tell it how many observations we want in the new dataset:
clear all
set obs 100
We've used the first line before. The second line tells Stata that we want to create a dataset with 100 observations (this number can be any positive integer). When you import an existing dataset, Stata does this automatically, so you don't need to use this command. But because we're creating a new dataset, we need to tell the program how big that dataset needs to be.
Observation number
It can be useful to define a variable based on the observation number alone or based on the value of another variable associated with a different observation. This will be especially important when you get to time-series, but we'll introduce the idea now. The code:
generate obsnum = _n
Creates a new variable called "obsnum" containing just the observation number. "_n" tells Stata to use the row number of that observation as the value of the variable. We can also use this function to define a variable based on a previous or future value of another variable.
generate pastobs = obsnum[_n-1]
generate futobs = obsnum[_n+1]
The variable pastobs now contains the value of obsnum from the previous observation, while the variable futobs contains the value of obsnum from the next observation. Take a look at the data browser. Note that the first observation of pastobs is missing, as is the last value of futobs. This makes sense, right? Pastobs is defined as the value of obsnum from the previous observation. For observation 1, there is no previous observation, so Stata can't define a value. This is not a problem, just something to remember. We can combine this function with basic math operations and we can use any past/future observation.
generate diff = obsnum - obsnum[_n-1]
gen diff2 = obsnum - obsnum[_n-2]
gen crazy = (obsnum[_n-2]*obsnum[_n+1])^(0.324)
If you create a variable based on different observations and you end up with missing values, this can happen if one of the values you tried to use did not exist. Again, this is not necessarily a problem, but is something to which you should pay attention. Some functions like differences and looking at values a particular number of observations away are so common in time-series analysis that Stata has special functions to make these faster to work with. We'll talk about these later.
Probability Distributions
We can also create new variables based on draws from probability distributions. The two most basic distributions are the normal and uniform distributions. The normal distribution is the bell curve. Most of the data is close to the mean, with less and less, the further we get from the mean. The code:
generate rand = rnormal()
creates a variable called "rand" containing draws from the standard normal distribution. That is, the bell shaped curve with mean 0 and standard deviation 1. That is the default, but we can set any mean and standard deviation we need. If we wanted a mean of 85 and a standard deviation of 5, we can do this with:
generate rand2 = rnormal(85,5)
The rnormal function takes arguments (mean, standard deviation). The mean can be any real number and the standard deviation can be any positive real number.
The uniform distribution is flat, meaning that any value between the lower bound and upper bound is equally likely. In Stata:
generate randu = runiform()
gives you draws from the uniform distribution on the open interval from 0 to 1 (so exactly 0 and 1 are not possible, but everything in between is). If you want to set a lower bound of 40 and an upper bound of 100, then you would do so as following:
generate randu2 = runiform(40,100)
NOTE: I assume that if you're learning Stata now then you have a recent version of Stata. The current version (as of writing this) is 15. If you have an older version of Stata, then the runiform() function doesn't work the same way. If this applies to you, then type
help runiform()
into the command line and read the instructions on how to use the command.
While we're talking about drawing random variables, we should talk about "setting the seed." Every time you draw a random variable, the results will be different. Why? Well, because they're random. But when you're working with data, you don't want the results to be different every time you run your code. We can avoid this by setting the seed. It works as follows:
set seed 382
gen rand3 = rnormal()
Everyone who runs those two lines of code (assuming they have the same version of Stata) will get the same values in the variable rand3. This works, because when program draw random numbers, they're not actually random. Stata and virtually all similar programs just have a big table of numbers. If you ask them to draw random numbers, they pick a spot on that table and start picking numbers in a particular order. The numbers after "set seed" can be any whole number. The same number will get you the same result. Different numbers will give you different results. Try running the following lines of code and look at the data browser to see what it produces.
set seed 5
gen test1 = rnormal()
set seed 5
gen test2 = rnormal()
gen test3 = rnormal()
set seed 4077
gen test4 = rnormal()
The variables test1 and test2 will be the same. The variables test3 and test4 will be different.
If you want to visualize data, the most basic figure is a histogram. It shows the range of one variable on the x-axis and displays bars showing how common ranges of values are in the dataset. Type the following code into the command line:
clear all
set obs 1000
generate x = rnormal()
hist x
The first three lines simulate some data. In the fourth line, "hist" is the Stata abbreviation for histogram. The variable x is drawn from the normal distribution, so the histogram should be a set of bars in roughly the shape of a bell curve. Notice that on the y-axis of the histogram, it's showing "density." If you add up all of the bars, they should sum to 1. This is Stata's default setting, but I don't like it. I prefer showing "the percentage of the variable x in each category" or "the number of observations in each category" instead. If you're putting a histogram in a presentation, either of these are easier for audiences to understand than density.
hist x, freq
hist x, percent
Note that whether you have density, frequency, or percent on the y-axis, the histogram looks the same. Why? Because it is the same. All the histogram does is show you this distribution of the data. Stata will automatically decide how many bars to show on the histogram and how wide each bar should be. If you want to change these or otherwise alter the appearance of the histogram, type:
help hist
And scroll through for the appropriate option(s).
Stata can create a wide range of different graphs depending on the type of variables you have and how you want to display them. Two-dimensional graphs (meaning one variable on the x-axis, one variable on the y-axis) are the most common. We'll only discuss two kinds right now, but to see the full list, consult the following help files:
help graph
help graph twoway
Most of the graphs you will need to create in an introductory stats class will be under "graph twoway" and will generally take the same syntax:
graph twoway graphtype yvariable xvariable [if] [,options]
As with most things in Stata, we list the dependent variable first. For instance, if we wanted to create a scatter plot of age (our independent variable) versus income (our dependent variable), the code would be:
graph twoway scatter income age
We can also graph variables conditionally. Suppose that we had a variable for sex coded 1 for female and 0 for male. We could create a separate graph for each as follows:
graph twoway scatter income age if sex == 1
graph twoway scatter income age if sex == 0
If we want, we could combine these into a single graph by separating commands with parentheses.
twoway (scatter income age if sex == 1)(scatter income age if sex == 0)
This would give us the same information, but now in just one graph, with male or female distinguished by different colors. Notice how we didn't even need to write "graph" at the start of the line. Every time Stata sees the command twoway, it knows that "graph" was the only thing that could have come before it. Because of this, the folks at StataCorp have made things a little easier and not forced us to repeatedly write out "graph."
In creating our scatter plots, we've had two options: create two separate graphs or create one graph with everything. If you want to put two graphs side-by-side, we can do this with the graph combine command.
twoway scatter income age if sex == 1, saving(graph1.gph, replace)
twoway scatter income age if sex == 0, saving(graph2.gph, replace)
graph combine graph1.gph graph2.gph
The "saving()" option in the first line tells Stata to save the first graph in a file called graph1.gph. The .gph extension is Stata's default graph format. The ", replace" says that if this file already exists, replace it with the new one. The second line is similar to the first. In the third line, we combine the two graphs, telling Stata which files to combine into a single file. After you execute all three lines, you'll see your two graphs in a single window.
All of the "graph twoway" kinds of graphs work more or less the same way. Bar graphs are a little more complicated. Suppose that instead of age measured in years, we have age categories (say 21-30, 31-40, 41-50, 51-60, and 60+). The code:
graph bar income, over(age)
would give us a bar graph showing the average value of income for each age group. Whereas all of the "graph twoway" commands have similar syntax, each of the types of regular "graph" commands have different syntax. If you want to do something other than a bar graph or twoway graph, you will need to consult the help files listed earlier.
Once you have told Stata the substance of what you want to include in a graph, you'll want to change the look and feel of the graph. Technically speaking, you don't have to do this, but Stata graphs are Ugly and the capital U is not a typo. The light blue-grey background, terrible color choices, lack of proper labels/titles, and so forth are almost willfully bad. Never just paste a raw Stata graph into a paper or presentation. You want to make your graphs as pleasant as possible for the audience to read or at least less painful than what Stata produces by default. If you want to play with the appearance of your graph, you can use Stata's graph editor. If you have a graph open already, it will appear in a separate window. Near the top of this window will be a series of menus and icons. One of the icons looks like a bar graph with a pencil on it. Click on this icon and the graph editor will open. You can double-click on different parts of the graph and edit them manually. Note that you can only change the look and feel of the graph, not its substance. If you want to create a different graph, you will have to do that with code. Once you have finished making changes to the graph, go to File --> Save As. Give your graph a name and store it in a place where you can find it later. Under "save as type" it should say "Stata Graph (.gph)." Do this first. This is just a Stata graph, not something you could paste into a paper or presentation. For that, you would need to save your graph in some sort of picture format. The .png and .pdf formats work well for pasting graphs into programs like Word or PowerPoint. If you work in LaTeX instead, you can save graphs in .eps format. The nice thing about the .gph format and the reason why I said to do this first is that if you want to reformat a graph while working on your paper or presentation, you can double-click on the saved file and the graph will immediately open in Stata without you needing to regenerate it from code.
The graph editor is great while you're learning statistics, as it reduces the amount of code you need to learn in order to produce good looking graphs. In the long run however, and if you are going to need to use Stata in multiple classes or in your career, it's better to format graphs with code rather than the data editor. If you've just created a twoway graph and want to customize it without using the graph editor, go to the help file:
help twoway options
and a whole host of options will appear. Each of the links in this file will take you to another huge help file. It can be a bit daunting, really. Let's talk about a few important ones now, but eventually you will need to work your way through the options, sub-options, sub-sub-options, and so forth if you really want to be in control of your graphs. First, I'll just list a few of the more common options.
title("Put The Graph's Title Here")
ytitle("Dependent Variable")
xtitle("Independent Variable")
ylabel(0(10)100)
xlabel(0(10)100)
scheme(schemename)
The first three are self explanatory. title() gives the graph a title and ytitle() and xtitle() let you control what words appear on the y and x axis, respectively. Stata's default is to show the variable names or labels on the y and x axes, but your graph's reader will thank (or at least not curse) you for giving them proper titles. ylabel() and xlabel() let you control how many tick marks are shown on each axis and how often. 0(10)100 tells Stata to start the axis at 0, end the axis at 100, and place markers every 10 units. This can be any whole numbers a(d)b such that a<d<b. If you try to set the value of a higher than the minimum value of the variable or if you try to set the value of b lower than the maximum value of the variable, Stata will ignore you or produce something really ugly. The last option I've listed here, scheme() lets you quickly change a bunch of options at once, including background color and the colors used on the graph(s). For a list of schemes, type:
help scheme
and play around with each. I personally find the s1color scheme to be the least offensive of the schemes that come pre-installed with Stata.
Now why didn't I just put all of those in a single line of code rather than explain it? Well think about how long that line of code would have to be. It would run way longer than the width of this page. That's not necessarily a problem, but it is annoying to work with, especially if you are writing a lot of code. If we're working in a .do file (and only in a .do file) rather than in the command line, we can change Stata's delimiter. By that, I mean we can change when Stata considers a line of code to be complete. The default is when you hit the "enter" button. This is called a "carriage return" and hearkens back to the day of typewriters, when at the end of typing a page, the typewriter would have to physically move or "return" back to the start of the next line. Even older keyboards still call it the "return" key. For more on this and more useful information about the bad old days, ask your grandparents. Back to delimiters. The code:
#delimit ;
code code code
code code code
code code code;
code code code
code code code;
code code code;
#delimit cr
Does three things. 1. It changes the delimiter to a semicolon. 2. It runs code as a single line until it hits a semicolon, then starts treating it as a new line. 3. It changes the delimiter back to the carriage return that you're used to already. Note that even though the code is spaced out into six lines, Stata will treat this as though it's only three lines. Lines 1-3 are treated as one line of code, lines 4-5 are treated as one line of code, and line 6 is treated as one line of code. It doesn't matter how you space out the code, Stata will treat everything as one continuous line of code until it hits a semicolon. Now we're ready for a graph example.
#delimit ;
twoway scatter yvar xvar,
title("Age Versus Income")
ytitle("Income (USD)")
xtitle("Age (years)")
scheme(s1color);
#delimit cr
If you are working in the command line, then you'll have to just write the code all in one line. This is another case in which writing your code in a .do file is not just better, it also makes things easier.
Stata comes with a lot of built-in functions, but not 100% of what you will eventually want or need to do. When users encounter tasks that they regularly need to carry out, but can't accomplish with existing Stata functions, they write their own packages that they can add-on to Stata's capabilities. These are called .ado files. If a person goes through the trouble of writing one of these packages, they generally share it as widely as possible, meaning that it's available for you to download. Once you install the package, you have access to the new functions as if they were any other Stata command. There are several ways to download .ado files: "findit," "ssc install," "net from," and just manually via your browser.
Some packages are not part of the base Stata distribution, but are managed by the folks at StataCorp. To install one of these, type:
findit packagename
Where "packagename" is the name of the function you want to install. Stata will bring up a window full of results. Scroll down the list until you find the desired package, then follow the links to install it automatically.
Other packages are stored in online repositories, the most common of which (for Stata) is Boston College's Statistical Software Components (SSC) archive. To install one of these, all you need to do is type:
ssc install packagename
and the package will install completely.
Still other packages are neither managed by Stata nor stored in an online repository. For these, the installation is only slightly more complicated. First you tell Stata where to look online and then you tell it what to download from that site.
net from https://rest/of/web/address/here/
net install packagename
Finally, if none of the above work, you can just download and install the package manually. You'll generally only need to do this if your network's settings are set up in a way that blocks Stata from downloading packages or if someone has emailed you the package rather than posting it to a website. In either case, download the package to your computer, then type
sysdir
into the command line. Stata will bring up a list of addresses on your machine, one of which will be marked "PLUS." For Windows users, this is usually C:\ado\plus\. If you open this folder on your computer, you will see a list of folders named with different letters of the alphabet. Open the folder with the same letter as the first letter of the package name and copy the .ado file and help file (if you have one) to this location. You've now installed the package and Stata will be able to find it.
After any installation, you should have access to new help files too. type
help packagename
into the command line and Stata will bring up the help window the same way that it would for any built-in Stata function.
(most certainly not a complete list)
estout: combines Stata regression output from multiple models into LaTeX code that can be copy-pasted into a LaTeX document. Also contains a function "esttab" that lets you quickly combine and compare models within Stata.
outreg2: combines Stata regression output from multiple models into Word/Excel/other formats
clarify: for interpreting and presenting results. The code for installation is:
net from https://gking.harvard.edu/clarify/
net install clarify
NOTE: If you get repeated messages saying "I/O error..." then you will have to download the package manually via your browser.
freduse: lets you directly download and import time-series Federal Reserve Economic Data (FRED). If you have Stata 15 or newer, you even get a graphic interface for searching through data files without going to the FRED website.
fetchyahooquotes: lets you directly download and import time-series stock market data
This is data where the main variation in your data occurs across space. For instance, suppose that you are studying the outcome of a nationally representative survey carried out in October 2018. You're interested in comparing differences between individuals, not how the opinions of individuals change over time. This is the most common type of variation used in the social sciences and also happens to be the easiest to introduce without significant treatment of linear algebra or multivariate calculus.
NOTE: If you are an introductory statistics student, some of what you see below might not make sense, because you haven't studied it yet. Consult your textbook and/or notes for what these things mean and if you need to use them.
Suppose that before I give my students a test, I expect that the average score will be equal to 80%. Then I give my students the test, observe their scores, and record them in a variable called "score." I can test my hypothesis with a one-sample t-test.
ttest score == 80
Grouped t-tests, as the name would suggest, are used to compare groups on values of a particular variable. We have Group A and Group B and want to see which has a higher average value for the variable x. For example, if we want to argue that trucks get lower gas mileage than cars, then we would need two variables: "vehtype" coded 1 for truck, 0 for car and "mileage" measuring miles driven per gallon of gasoline (kilometers per litre of petrol works just as well for this example).
If you can assume that Group A and Group B have equal variance on mileage, then the code would be:
ttest mileage, by(vehtype)
If you cannot assume equal variances, then it would be
ttest mileage, by(vehtype) unequal
Note that this only works if the group variable is binary. If you have three or more groups, Stata can do this too, but it takes an additional step. Suppose that instead of just cars and trucks, we treat Sport Utility Vehicals (SUVs) as their own separate group. Now vehtype is coded 0 for cars, 1 for trucks, 2 for SUVs. We can only do these t-tests two at a time:
ttest mileage if vehtype != 0, by(vehtype)
ttest mileage if vehtype != 1, by(vehtype)
ttest mileage if vehtype != 2, by(vehtype)
The added bit of code tells Stata, in each line, to ignore one of the groups and only compare the two remaining groups. We can compare trucks to SUVs, cars to SUVs, and cars to trucks.
These are just two of many different types of t-tests. For a list of some that Stata can execute with a different setup of the same command, type
help ttest
If we have two continuous variables, we can measure the extent to which they move-together or move-apart by looking at the covariance and/or correlation coefficient. In both cases, a positive sign means that as one variable increases, the other variable tends to increase, while a negative sign means that as one variable increases, the other tends to decrease. They differ in calculation, application, and interpretation. At the introductory level, covariance is generally taught first, because it's useful in understanding correlation and regression, but it's not that helpful in interpretation. It's much more useful in intermediate or advanced courses. In contrast, the correlation coefficient is instantly useful in interpretation, but isn't used nearly as much in more advanced statistics. We can get both from Stata easily.
Correlation Coefficient:
correlate varlist
Covariance:
correlate varlist, covariance
"Varlist" can be a list as variables as long as you would like. The result will be a table of numbers. If you go to the column for your first variable and row for your second variable, the value will be the correlation coefficient or covariance, depending on what you told Stata to calculate. If you have Stata calculate correlation coefficients, you'll see 1.0000 along the diagonal. This makes sense, as we would expect a variable to be perfectly correled with itself. If you have Stata calculate covariance, the diagonal will contain the variance of that variable. On the top right of the table, you'll just see blanks. This is not a mistake, rather that Stata doesn't need to give you the same information twice. The correlation between variable 1 and variable 2 is the same as the correlation between variable 2 and variable 1.
Correlation coefficients tell you the direction of a relationship between two continuous variables with a positive of negative sign. They also tell you the strength, but only to the extent that -1 means perfect negative, +1 means perfect positive, 0 means absolutely no relationship, and values close to any of these can tell you that a relationship is strong or weak. Beyond that tough, correlation coefficients don't add much to our understanding. Wouldn't it be nice if there was another tool that would let us say "a one unit increase in X leads, on average, to a particular change in Y?" Well if this problem keeps you up at night, you can rest easy, my friend. Ordinary Least Squares (OLS) regression can do just that. Stata can do this with the regress command
regress y x
where "y" is your dependent variable and "x" is your independent variable. This one little line of code produces a whole table of results, which can be a bit daunting your first time through, as some of what you'll see won't mean anything to you yet. Let's take it one piece at a time. First, the column marked "Coef." stands for "coefficients." The value directly next to your independent variable is the estimate of beta. If you were looking at the graph of a straight line, this would be the slope. Below that, you'll see "_cons." This is the "constant," also referred to as alpha or as "beta naught." On that same graph of the straight line, this would be the y-intercept. That is, it is the expected value of our dependent variable, if our independent variable is equal to zero.
The rest of the table is giving you information on whether our estimated coefficients are different from zero and how well clumped is the data around the straight line that we estimated. "Std. Err." is the standard error. Lower values mean that we are more confident in our estimate. The next four columns all help us decide whether our estimate is statistically different from zero. "t" is the t-score. It is calculated as the coefficient divided by the standard error. "P>|t|" is the p-value. If I claimed that x is correlated with y, this is the probability that I'm wrong. The norm within science is that we want this p-value to be smaller than 0.05. If the p-value is smaller than 0.05, then we can "reject the null hypothesis at the 95% level." If it is bigger than 0.05, then we "fail to reject the null hypothesis at the 95% level." The p-value reported by Stata is the same as if you took that t-score and compared it to a t-table, the way that you might need to do on a homework or exam question. The only difference is that Stata's t-table is much more precise than most tables you would find in the back of a textbook. Finally, the "[95% Conf. Interval]" is the 95% confidence interval around our estimate. In other words, we have our estimate of beta, but it's just one estimate and is based on a sample. We're 95% sure that the "true" value of beta is somewhere between the lower bound and upper bound.
That was two long paragraphs to understand about half of what Stata reported after one line of code. The good news is that if you're taking a first stats class, you can probably ignore most of the rest. The one thing I would call your attention to is R-squared, which is a "measure of model fit." It's the fourth row on the top right of the output. It tells us the proportion of variation in our dependent variable being explained by our independent variable. For instance the value 0.8422 would mean that our model explains 84.22% of the variation in our dependent variable. As an example, try executing the following code in Stata:
clear all
set obs 150
generate x = rnormal()
generate y = 1.35 + 2.4*x + 0.2*rnormal()
regress y x
You should see a value close to 2.4 for the coefficient of x and a value close to 1.35 for the constant. It won't be exact, but it will be close. Why? Because of that "0.2*rnormal()" at the end of the fourth line. Try playing around with these five lines of code. Changing the numbers in line four could help you better understand what they do. You should also try changing the number in line two.
After you've finished playing, run the original five lines of code again and then this:
twoway (scatter y x) (lfit y x)
This is a graph of the regression we just ran. The dots represent each data point and the line represents the straight line we just calculated with the regress command. Beta is the slope of that line and the value of y when x is equal to zero is the constant.
We're not limited to just one or even a specific number of independent variables. Including more than one independent variable is called "multiple regression" and works the same way.
regress y x1 x2 x3 x4
The results Stata produces look very similar, just with one extra row in the table for each variable you've added. The above example would have rows for x1, x2, x3, x4, and the constant. The estimated value of the beta for x1 would be interpreted as "the effect of a one unit increase in x1 on y, holding x2, x3, and x4 constant."
After running OLS regression, we can generate predicted values of our dependent variable and residuals (the difference between the predicted value and the actual value of the dependent variable) using the predict function. This works the same whether you have one independent variable or many independent variables.
regress yvar xvarlist
predict yhat
predict yhat1, xb
predict yresid, residuals
generate yresid1 = yhat - yvar
The second and third lines do the same thing: they create a new variable with predicted values of the dependent variable "yvar" based on our model. xb is the default option for the predict command. If you've taken linear algebra before, then the name "xb" makes a bit more sense. The fourth and fifth lines also do the same thing: they create a new variable containing the residuals.
Nominal variables are variables that are categories, but without an obvious order. Suppose that you had a theory saying that restaurant profits depend on the color the dining room is painted. Your dependent variable is "profit" and your independent variable is "paint," coded 1 for white, 2 for black, 3 for neon yellow, and 4 for red. Directly using this type of variable in OLS regression would be inappropriate. Recall that a regression coefficient means "a one unit increase in x leads, in expectation, to a beta change in y." For this to be plausible, a one unit change in x needs to mean the same thing for all possible values of x. Here, it does not. It doesn't make sense to think of, for example, neon yellow as being one unit better than black, nor for red to be one unit better than neon yellow. Instead, we need to break the paint variable into a series of dummy variables.
tabulate paint, gen(paintcat)
The first part of this code looks familiar. "tabulate paint" creates a table showing the number of observations in each category of paint. ", gen(paintcat) then creates a new dummy variable for each category of the variable paint. Take a look at the variable list on the top right of your screen. You'll see that new variables have appeared in your dataset. The above code will create, in this example, four new variables: paintcat1, paintcat2, paintcat3, and paintcat4. The variable paintcat1 is coded as 1 if the variable paint was coded as 1, 0 otherwise. The other variables are similar, but are coded 1 if the variable paint was that particular value 0 otherwise. Now we can use these dummy variables in our regression instead of our original paint variable.
regress paintcat2 paintcat3 paintcat4
Note that we only included three categories. This is to avoid perfect multicollinearity, also referred to as the "dummy variable trap." If we want to include all four dummy variables, we have to suppress the constant term. We can do this with the code:
regress paintcat1 paintcat2 paintcat3 paintcat4, noconstant
There's another way we could do this and that's with a wildcard:
regress paintcat*, noconstant
The star at the end of the paintcat variable name tells Stata to include every variable that starts with "paintcat" no matter what comes next. We have four paintcat variables, so Stata treats the above code the same as if you had written out all four. This will be useful if you have to work with a lot of repetitive variable names.
This part requires you to have installed the estout.ado and outreg2.ado packages listed among the useful packages in an earlier section. Do yourself a favor and do this. It will save you approximately 1.3 metric tons of menial labor when writing papers.
One of the most tedious tasks involved in data analysis is putting together tables of results. It's easy, just boring, repetitive, and time-consuming. Moreover, even after you've finished a draft of your paper or presentation, someone (a co-author, professor/boss, etc) will ask you to include a different combination of variables for one reason or another. When you do this, your table will change. Over the course of your project, you could end up spending hours just manually entering coefficients on tables. OR, you could get Stata to do most of the work for you. To do this, you need to (1) run a model, (2) store the result, (3) repeat steps 1 and 2 as many times as needed, and (4) output the results into a table.
First, let's run three models and put them into a quick table within Stata. Doing this is nice, because it lets you quickly compare results across multiple models without having to scroll up and down constantly. We'll use the "eststo" (for "estimates store") and "esttab" (for "estimates tabulate") commands, which are both part of the estout.ado package.
regress y x1
eststo first
regress y x1 x2
eststo second
regress y x1 x2 x3
eststo third
esttab first second third
The three "eststo" lines store the results of the regression we ran immediately before. In each case, we give the result a name, so that we can call it back later. I've just named the models "first," "second," and "third," but you can name them whatever you'd like. The "esttab" line tells Stata to combine the three models into a single table. The result isn't fancy, but is easier than scrolling through potentially pages of results.
The next step, creating a formatted table, will depend on whether you are working in Word or in LaTeX
If you work in Word...
We'll do the rest with the outreg2.ado package. It's just one line of code beyond what we've already done.
outreg2 [first second third] using example_table.doc, replace
This will create a formatted table called "example_table.doc" (note that it's .doc not .docx) located in your working directory. If you open this document, you'll have a formatted table that you can paste into your paper or presentation. It won't be 100% formatted, but maybe 95%. Any remaining changes will be relatively easy.
This code includes certain statistics by default. If there are other things you need to include in your table, you may be able to tell Stata to add them. See the outreg2 help file for details and there are many examples online. If you can't figure out how to get Stata to add a particular statistic as part of the outreg2 command, you could always just use the above code to create a basic table and then add the last bits manually.
NOTE: After you've finished looking at your table and/or pasting it to your paper, be sure to close the file that outreg2 created. If you run your code a second time with the file still open, you'll get an error message.
If you work in LaTeX...
We'll keep working with the estout.ado package. Before you start, create a template for your table in your LaTeX editor. This will be something like:
\begin{table}
\begin{center}
\caption{Table Title Here}
\begin{tabular}{l c c c}
\end{tabular}
\end{center}
\end{table}
In the blank space, you'll paste your models. This example is for three models, but you can include as many as you would like (up to the maximum that you can fit on a page). Now we can use the following code in Stata for a very basic table:
estout first second third, style(tex)
Consulting the estout.ado help file, you can also find more options to customize what is included in the table and how it is formatted. Formatting a table can make for a really long line of code, so we'll use the same delimiter trick that we used when introducing figure formatting.
#delimit ;
estout first second third,
cells(b(star fmt(3)) se(par fmt(3)))
starlevels(* 0.10 ** 0.05 *** 0.01)
legend label varlabels(_cons constant)
margin stats(r2 N)
style(tex);
#delimit cr
If you're already to the point of working with LaTeX, then you can pretty easily figure out what each line in the above does.
Many social science phenomena are binary: yes or no, something did or did not happen, and the like. If we use OLS with this sort of dependent variable then it is called a "linear probability model." Pretty much the only nice things about these is that they are easy to estimate and interpret. A one unit increase in an independent variable leads, on average, to a beta change in the probability that our dependent variable is equal to one. However, don't do this unless you have no other choice or a textbook/homework/quiz question directly asked you to do this. Why? It would be bad. The video on that link is hyperbole. Speaking more practically, these models generate unrealistic predictions and almost guarantee heteroskedasticity. If your class has introduced the Gauss-Markov assumptions of Ordinary Least Squares regression, this should mean something to you. Try the following code
clear all
set obs 1000
gen x = rnormal()
gen y = rnormal() + .5 + (.4*x)
replace y = 1 if y >0
replace y = 0 if y <= 0
regress y x
twoway (scatter y x) (lfit y x)
If you look at the regression results, OLS appears to run just fine. But look at the graph we created. Not only does OLS not seem to fit the data particularly well anywhere, but the variance of our errors is certainly not normal.
One solution is to try to correct for the fact that we have heteroskedasticity. We can do this by using Heteroskedasticity Consistent Standard Errors (also called robust standard errors) instead of the regular ones we get with OLS.
regress yvar xvarlist, vce(robust)
This keeps the same beta estimates from OLS, but uses a different estimation procedure to calculate the standard errors that we will use in hypothesis testing. These standard errors are supposedly "robust" to the heteroskedasticity. But this doesn't do anything for the fact that our predicted values of y are nowhere near the real values of y.
You would probably not see this in an introductory stats class, but if you have at least a couple hundred observations, then you can use Maximum Likelihood Estimation (MLE). If you have a dataset smaller than this, then you have to stick with OLS.
MLE isn't a single model the way that OLS is, rather it is a process. For more on the what and why of MLE, consult your textbook or professor. Here, we're focusing on the how. For binary dependent variables, there are two common options: logit and probit. They both do pretty much the same thing and in Stata terms the code is almost identical to how we did OLS regression:
logit yvar xvarlist
probit yvar xvarlist
The interpretation is not the same as it was in OLS, but we can still tell whether or not an effect is significant in more or less the same way. If the p-value is smaller than 0.05, then the independent variable has a significant effect on the dependent variable. With logit/probit, we will avoid the heteroskedasticity problem that we had in OLS and we'll end up with more realistic predictions.
Technically speaking, for OLS to be appropriate, the dependent variable needs to be continuous. Practically speaking, we treat many variables as if they were continuous, as long as there's a good deal of variation. The more variation we have in the dependent variable, the less likely we are to violate the assumptions of OLS. Binary dependent variables were the extreme case of having only two possible values. But what if we had three possible values? Suppose we were studying political ideology and that it's possible for a person to say that they're either conservative (-1), moderate (0), or liberal (1). Logit and probit won't work here, because they only work with binary dependent variables. OLS also wouldn't be appropriate here for the same reason that we couldn't use it on a binary dependent variable. Instead, we can use ordered models: ordered probit or ordered logit. Their code is again almost identical to what we've done before, just the interpretation is different.
ologit yvar xvarlist
oprobit yvar xvarlist
In the above case of political ideology, we were able to give our categories an order that made sense: a person coded as 0 was more liberal than a person coded -1 and a person coded 1 was more liberal than a person coded 0. For other categorical variables, this is not the case. Just like when we talked about nominal independent variables earlier, for nominal dependent variables, we again need a different type of model: multinomial logit or multinomial probit.
mlogit yvar xvarlist
mprobit yvar xvarlist
If your dependent variable has k categories, then mlogit or mprobit will give you k-1 sets of results. In each set of results, the coefficients tell you how a one unit change in the independent variable changes the likelihood that the observation takes that particular value of the dependent variable. These are tricky to implement and interpret correctly, so I strongly advise that you read up on this type of model before using one in, say, a paper. Often, you can use a nicer alternative: recode your nominal dependent variable as a binary dependent variable and just use regular logit/probit instead. Returning to our earlier example of dining room paint color, suppose that we're actually interested in explaining why a restaurant chose to use red paint as opposed to any other color. We could use red (1 for red, 0 otherwise) as our dependent variable. If we also wanted to explain why a restaurant chose white instead, we could run a second model with white (1 for white, 0 otherwise) as our dependent variable. Alternatively, if we wanted to explain why a restaurant chose any color as opposed to white, we could code a new variable "color" (coded 1 for any color, 0 for white).
This has most certainly not been an exhaustive treatment of cross-sectional data, even if working through it was exhausting. For now though, let's transition to discussing time-series data.
This is data where the main variation in your data occurs across time. For instance, suppose that you are looking at monthly unemployment data from 1950 to present. Working with this sort of variation is a bit trickier than working with cross-sectional data, which is why it is typically not taught in a first statistics course. If you are doing this within Stata and are using time-series for the first time, I recommend "Introduction to Time Series Using Stata" by Sean Becketti. It says "introduction," but you should understand at least OLS regression before you even attempt time-series.
Stata's default is to assume that you are working with cross-sectional data. If your main variation is across time, then that's fine, but you need to tell Stata first. You do this by "tsset-ing" your data. This tells Stata that you're working with time series data and what your units of time are. At the bare minimum, your dataset needs some variable that you're interested in (y) and a variable counting time (timevar). Using the tsset command will be one of the first things you do when you open a time-series dataset. The command takes the syntax
tsset timevar, timetype
where timetype can be:
For example, if our dataset consists of daily stock prices, then the command would be:
tsset timevar, daily
Once we do this, then Stata knows that we're working with time-series data. This enables you to use certain functions and models that are only appropriate for time-series. These are not the only time intervals that Stata can work with, they're just the ones for which the tsset command has built-in options. We can set up other intervals with the delta() option in tsset. Suppose that our observations occur every even numbered month of the year. If we did
tsset timevar, monthly
then it would look like we have missing data for every odd numbered month. But that's not the case. The data's not missing, it just shouldn't exist at all. We can get around this by tsset-ing with
tsset timevar, monthly delta(2)
"monthly delta(2)" tells Stata that this is monthly data, but it should only expect one observation every other month.
It would be nice if the above approach would work for all time variables, but it won't. The trouble is that real data we want to use is dated in ways that make sense to humans, not that makes sense to Stata. Fortunately, as long as your data is stored in some systematic fashion, there is almost always a way to get it into a form that Stata can use. For a full list of available options go to:
help date and time functions
and Stata will return a huge list of different ways that you can extract the date and/or time from your variable. Unfortunately, the list only contains the technical bits and doesn't directly include examples that might be useful to, say, someone who would need a help file for this. So, to get you started, here are a couple of examples.
Suppose that your time variable is called "date," is of date format DayMonthYear (for example 5Jan2019), and you have one observation every month. You could tsset this data with
generate datem = mofd(date)
tsset datem, monthly
The first step creates a new variable called datem. The mofd() function extracts the month part of the date and gives it a unique value based on how many months a particular date is away from January 1960. As far as Stata is concerned, that's the beginning of time (though some of my older colleagues would beg to differ), so if January 1960 is an observation in your dataset, then it will be coded as month 1. If your data goes further back than this, no worries. Stata can handle that just as easily, it will just be given a negative number. This really does not matter for our purposes.
If our data is quarterly rather than monthly, we can do almost the same thing.
generate dateq = qofd(date)
tsset dateq, quarterly
The start of 1960 is again the beginning of time, but now qofd() will assign the number of quarters since then, instead of the number of months. Getting started with time-series data really isn't tough. It just seems a bit daunting because there are so many different ways to start before you can do anything fun.
Once you have tsset your data, Stata enables certain functions to make your life easier. Some of these are just simpler syntax on existing commands and some are new commands entirely. Many of these you could execute manually, but why bother, when Stata can do them easily and accurately? Among the possibilities are lags, leads, and differences. If we were working with daily stock market data, the "once-lagged" stock price would mean the stock price one day ago. If our variable was called "price," we could create a new variable manually:
generate laggedprice = price[_n-1]
But now, since Stata knows that we're working with time-series data, it knows that the information stored in the new variable "laggedprice" isn't really any different from what it already had stored in "price." So instead, we can just use the variable
L.price
at will, as if it was a variable we had taken the time to define, as we did with the generate command. If we want to go any number of days in the past, we can do the same thing by adding a number after the L:
L.price
L2.price
L3.price
A lead is just the opposite. It looks a certain number of units into the future instead of a certain number of days in the past. We can do this with F. and it works the same way as L.
F.price
F2.price
F3.price
Differences get a little more complicated, but only slightly. D.price would mean "the difference between today's price and yesterday's price." However, while you would think that D2.price would mean "the difference between today's price and the price two days ago," that's not what it means. Instead, it is the difference between the difference between today and yesterday and the difference between yesterday and the day before. This is the difference in differences and I think it's a little more clear if we look at the algebra.
D.price = price_{t} - price_{t-1} (note: this is math, not code)
D2.price = (price_{t} - price_{t-1}) - (price_{t-1} - price_{t-2})
If you do mean "the difference between today's price and the price two days ago," there's another function called "seasonal difference" that will do this. S. is the difference between today's price and the price a given number of days ago.
S.price
S2.price
S3.price
So while D.price and S.price give you the same thing, D2.price and S2.price would not.
As an example of why this is useful, suppose that we wanted to run an OLS regression with y as the dependent variable and up to five lags of x as the independent variables. All three of the following are exactly the same, as far as Stata is concerned.
regress y x L.x L2.x L3.x L4.x L5.x
regress y x L(1/5).x (note, there's no space between ")" and "." )
regress y L(0/5).x
Recall that in Stata notation (0/5) would mean "from 0 to 5." L(0/5).x tells Stata to include all lags of x from 0 to 5. The original variable x is the same as saying "the 0-th lag of x," though no sane human would actually say that. The L. function saves you from needing to generate five new variables and the (#/#) operator saves you from needing to type out the whole list of lagged variables.
This section requires you to have installed the freduse.ado package. See the above section on useful packages. The freduse.ado package lets you quickly and easily download macroeconomic data from the Federal Reserve Economic Data (FRED) repository. It's real data that's constantly updated. It takes the syntax:
freduse seriesname
where "seriesname" is the code that the Federal Reserve of St. Louis has used to store the particular series. For instance,
freduse UNRATE
would automatically download the entire seasonally adjusted monthly unemployment rate series. You can search through the FRED website to find what data is available, how often it is recorded, and how it is named for Stata purposes. Freduse lets you download more than one series at a time. You just need to add however many series you want to the list
freduse series1 series2 series3
Note that after you download the data, you will still need to tsset it. Starting in Stata 15, you get a graphic interface directly in Stata. Go to File--> Import --> Federal Reserve Economic Data and a new window will pop up letting you use keywords to find available data. This is nice if, say, your professor told you to use a specific series for your homework. If you need to understand what a variable means before you download it though, it's better to use the website.
This section requires you to have installed the fetchyahooquotes.ado package. See the above section on useful packages. Just like freduse, as described above, fetchyahooquotes is a command that lets you quickly get real data. This time though, it's stock price data and takes the syntax
fetchyahooquotes seriesname, freq(time)
where "time," can be "d" for daily, "w" for weekly, "m" for monthly, or "v" for dividends. The freq() option is required. The seriesname is usually the stock ticker symbol. For example
fetchyahooquotes IBM, freq(d)
would give you the entire series of daily closing (end of business day) prices for IBM. I say "usually," because sometimes Yahoo! Finance has its own symbol. For instance, the Yahoo! Finance symbol for the S&P 500 is ^GSPC, whereas Google Finance uses the symbol .INX for the same series. Therefore, you will have to go to Yahoo! Finance to make sure that you have the appropriate ticker symbol. After you download the series, you will have to tsset it.
NOTE! If you try to download a series and Stata displays the message "seriesname does not have sufficient number of observations," it's almost certainly telling you that there are exactly zero observations. The problem isn't lack of data, but rather that you have the ticker symbol wrong. If this happens to you, double-check Yahoo! Finance.
Just like with freduse, you can download multiple series in a single line of code by adding additional ticker symbols:
fetchyahooquotes series1 series2 series3, freq(d)
Once you have imported a dataset into Stata and tsset it, a good next step is to look at the pattern in your data. We'll do this with a line graph. You probably encountered line graphs in your childhood math and science classes. On the y-axis we have values of our variable of interest and on the x-axis we have time. In fact, in time-series analysis, time is on the x-axis so often that we sometimes even call it the "t-axis" instead. We can generate a line graph in Stata with the tsline (for "time-series line graph") command. This is one of several special types of graph that are only available to us after tssetting the data. It works as follows:
tsline varname
where "varname" is the name of the variable you want to show on the y-axis. Your time variable will automatically appear on the x-axis. After all, this is a time-series line graph, so there's nothing else we would want to have there instead. In some respects, the tsline command is even easier to work with than the cross-sectional kinds of graphs (or at least not any more difficult). If we want to show two series on the same set of axes, then we could write:
tsline var1 var2
Formatting the graph works pretty much the same way as it did with other types of graph that we have already discussed. Consult the help files for formatting options and examples.
Time-series data is composed of different patterns. You'll work with each of these more in your class or text, and you'll likely spend a considerable amount of time learning why these are important. At a basic level:
Time series = trend + seasonality + cycle + noise/other
For different types of analysis, you may need to isolate or remove some of these from the series. For now, we'll talk about trend and seasonality.
Detrending Data
The trend is the overall path of the data upwards or downwards. It does not mean that the series always goes up or always goes down, just that in the data we have, it tended to do one of these more than the other. For instance, the trend in US GDP is that it has gone up over time. In many cases, we don't really care about the trend beyond the fact that it exists. We're more interested in why the series departed from this trend. Why do we have some quarters in which GDP was lower than the trend and why do we have some quarters in which GDP was higher than the trend? To answer this sort of question, we have to identify the trend and remove it. Suppose that we look at our variable "yvar" and notice what looks like a linear trend. We can detrend yvar with the following code:
regress yvar timevar
predict yvar_detrended, resid
The first step runs an Ordinary Least Squares regression with yvar as the dependent variable and our time variable as the independent variable. Recall that OLS plots the best fitting straight line through the data, so if there really is a linear trend in the data, OLS will give us both the slope of that line and its y-intercept. The second line "predict yvar_detrended, resid" creates a new variable "yvar_detrended," based on the results of the regression. The ", resid" uses the residuals from the regression to assign values. So, we have a new variable and each observation of that variable is the difference between the OLS prediction and the value of the raw variable. What this does is remove the linear trend from our original variable and leave us with the remaining information that we have yet to explain. With this new variable, we could start to look at why our variable was sometimes higher or sometimes lower than the linear trend.
Sometimes though, the trend will not be linear. For instance, it could be exponential. We can remove this sort of trend by taking a log-transformation first and then following the same process, just with our transformed variable instead.
generate logyvar = log(yvar)
regress logyvar timevar
predict logyvar_detrended, resid
Now your new variable "logyvar_detrended" would mean "how much above or below the log trend is our observation?" The same sort of approach can be used for other types of trends too, but the two trends discussed here are the most commonly seen ones at an early stage.
Deseasonalizing Data
Seasonality in data is a regular, repeated, predictable pattern that we observe in the data. For instance, in much of the world, unemployment rates tend to go up during winter months and down during summer months. This is because firms don't require certain functions be performed at certain times of year, so instead of hiring permanent employees for 12 months of the year, they hire temporary "seasonal" workers just when they're needed. You might think of shipping and logistics companies that hire additional workers for only November through January. Why do they do this? Because they expect customers to be sending many more packages around Christmas time than they do the rest of the year. So many employers do this sort of thing that it can show up in macroeconomic data. In reading a news story, you might hear the term "seasonally-adjusted, non-farm payrolls." What this means is "number of people who are employed, controlling for patterns that we know are there for reasons we don't really care about." If we were studying unemployment, we would be interested in why unemployment went up or down beyond the seasonal pattern that we see every year.
Suppose that our unemployment variable is recorded monthly, that we have a variable "month" coded 1-12 for the twelve months of the year, and that our data consists of several years. If we want to get rid of the seasonal pattern in unemployment, we can do so as follows
tabulate month, gen(monthdummy)
regress unemp monthdummy*, noconstant
predict unemp_seas_adj, resid
This should look familiar from earlier sections, because... it is. The first step generates twelve dummy variables: monthdummy1 (January), monthdummy2 (February), ..., monthdummy12 (December), each coded 1 if the observation comes from that particular month, 0 otherwise. The second runs a multiple regression model predicting unemployment using only the twelve month dummy variables. This model is run without a constant term, so that we can include all twelve dummies without running into the perfect multicollinearity problem you should be familiar with by now. The third step removes the variation in unemployment that results solely from the month of the year. What you're left with is the seasonally adjusted unemployment rate.
I should note that this is not the only way to seasonally adjust data and, in fact, the variable you get from this process would look strange if you plotted it. Go ahead and try it with the below code that runs you through the full process.
clear all
freduse UNRATENSA
rename UNRATENSA unemp_nsa
gen datem = mofd(daten)
tsset datem, monthly
gen month = month(daten)
tab month, gen(month_dummy)
regress unemp_nsa month_dummy*, noconstant
predict unemp_sa, resid
tsline unemp_nsa unemp_sa
Take a look at the time series line graphs this code generates. One series is labeled "Civilian Unemployment Rate." This is the raw variable. The other series is labeled "Residuals." This is the seasonally adjusted variable. You should notice two things about the seasonally adjusted variable. For one, the series is smoother. This is exactly what we wanted the deseasonalization to do, as we've gotten rid of the uninteresting fluctuations that occur from month to month. The other thing you'll notice is that the entire series has been shifted downward. What this seasonally adjusted variable is telling us right now is how much above or below the overall average unemployment rate we were in a particular month. This is not a problem, as in an analysis, what we actually care about is variation in the unemployment rate, not what the number was in a particular month. But if we want to be complete and really have a seasonally adjusted unemployment rate, we have one step left: re-benching. That is, we want to shift the adjusted series up so that it's comparable to the original variable, just without the seasonal fluctuations. All we need to do is add the average unemployment rate back into the seasonally adjusted variable.
summarize unemp_nsa unemp_sa
If you'll do this, you'll the the mean value of both series. As of January 2019 the values were 5.763028 for the raw variable and -1.09e-09 for the seasonally adjusted variable. These numbers will change depending on when you run the code. Remember that this is real data. Also remember that "...e-09" is Stata's way of saying "this number is really small, but I don't want to say that it's exactly zero." For our purposes, it's zero, so all we have to do is update the seasonally adjusted variable as follows:
replace unemp_sa = unemp_sa + 5.763028
Now take a look another time series line graph, to make sure that it worked.
tsline unemp_nsa unemp_sa
And yup, it worked. Practitioners have more sophisticated ways of seasonally adjusting data, so if you compare the variable we constructed here with the seasonally adjusted unemployment variable from FRED, it won't match 100%, but this is the general idea.
So now we've worked our way through a bit of the background on time-series data and you're probably thinking to yourself, "PLEASE, FOR THE LOVE OF RICK SANCHEZ, JUST SHOW ME HOW TO ESTIMATE A MODEL." Well all okay, Morty. Okay. Geez. Here you go.
regress y x1 x2 x3
What the heck? I thought you said time-series was tough. That just looks like OLS. Let's try that regression model we ran in the previous section.
regress unemp_nsa month_dummy*, noconstant
I mean that worked. Unemployment is a continuous variable and the month dummy variables are okay. So what's wrong with that? EVERYTHING. Well, not everything, but a lot. OLS regression is built upon a series of assumptions, two of which are no autocorrelation and no heteroskedasticity. Using time-series data virtually guarantees that you'll have both in any model. The reason time-series is generally taught separately is that time-series models are built around solving these problems. We basically only use OLS in time-series for purposes of detrending, deseasonalizing, and showing how OLS is not the thing to do.
To visualize the extent to which autocorrelation is a problem, we can look at the autocorrelation function (ACF). We do this with the ac command, which takes the following syntax:
ac varname, lag(k)
Where "varname" is the name of our variable, and "lag(k)" tells Stata how far back to look. The ACF is comprised of a series of bars. Each bar represents the degree to which the value of the variable at time t is associated with the value of the same variable k periods ago. That's what autocorrelation is. OLS assumes that the value of the dependent variable today is not at all associated with the value of the dependent variable yesterday. Let's get a new dataset (the stock price for Disney) and see if this holds.
clear all
fetchyahooquotes DIS, freq(m) start(1Jan1985) end(1Nov2018)
gen datem = mofd(date)
tsset datem, monthly
ac adjclose_DIS, lags(100)
If there's no autocorrelation here, we should see a bunch of random-looking bars, centered around zero (on the y-axis). The grey region in the graph represents the 95% confidence interval. In order to be able to say that we do not have evidence of autocorrelation, we need fewer than 5% of the bars on this graph to be lying outside of the grey region. What we get is not at all random-looking and definitely more than 5% (closer to 30%) of the bars are outside of the confidence bands. Based on this, we would say that we have strong evidence of autocorrelation. Practically speaking, most raw time-series variables you encounter are going to have some degree of autocorrelation. With stock prices for instance, the price today is largely determined by the price yesterday. Stock aren't randomly revalued everyday, rather they go up or down by (usually) just a small increment from one day to the next.
Let's try something else, with the same dataset. Instead of looking at the raw stock price, let's look at how much the variable went up (or down) from one day to the next. We can do this by using the difference function.
gen diff_DIS = D.adjclose_DIS
ac diff_DIS, lags(100)
This looks much more random and only 2 out of 100 bars lie outside the confidence interval. So we would say that we do not have evidence of autocorrelation for the differenced variable. It makes sense that we don't have evidence of autocorrelation here, as just because a stock went up or down one day does not mean that it will do the same thing the next day.
Newey-West
A first approach to addressing the problems of heteroskedasticity and autocorrelation is similar to the first approach we took to heteroskedasticity in linear probability models (OLS models with binary dependent variables). That is, we can try to correct our standard errors so that we can keep our original point estimates and still do hypothesis testing. With binary dependent variables, these were Heteroskedasticity Consistent Standard Errors (AKA "robust" standard errors). Now, because we have both heteroskedasticity and autocorrelation, we need something of a stronger fix, namely Newey-West standard errors.
newey yvar xvarlist, lag(k)
The beta estimates from Newey-West are generated using OLS. If you had instead typed:
regress yvar xvarlist
you would get the same beta estimates. But now we're using a fancier procedure to correct for heteroskedasticity and autocorrelation. The option "lag(k)" tells Stata how far back you want this procedure to go. If you looked at the autocorrelation function and saw that the first four bars were outside of the confidence interval, you would do
newey yvar xvarlist, lag(5)
This is the simple version. An alternative approach (one more commonly seen outside of social science) is to set the number of lags equal to the fouth-root of the number of observations. That is, if we have n observations, we would set k = n^(1/4). This seems to be more of a guideline than a fixed rule. My advice would be to choose whichever of the two is more conservative (larger), though it's not likely to affect your results more than minutely.
Transformation
A second approach to dealing with autocorrelation is to remove it algebraically by including lagged dependent and independent variables. Suppose that our original model would have been
regress y x
but we know that this model suffers from autocorrelation of degree one. That is, the error term at time t is correlated with the error term at time t-1, but not t-2, t-3, or beyond. If we instead run the model
regress y L.y x L.x
we're still able to use our original regression model, but with the autocorrelation removed by the (definitely not) magic of algebra. If we had autocorrelation of degree two, the error term at time t is correlated with the error term at time t-1 and t-2, but not t-3, t-4, or beyond, then we could run the model:
regress y L.y L2.y x L.x L2.x
See if you can convince yourself that this works via either algebra or logic. Me, I think my caffeine levels are too low. Maybe I'll come back to this later.
Feasible Generalized Least Squares (FGLS)
Still another solution is to apply FGLS. Solutions (1) and (2) both used OLS, but applied some sort of fix to get around the problem that we were violating a key Gauss-Markov assumption. FGLS is a different estimation strategy and involves two stages. First, it estimates model parameters using OLS or GLS and then estimates the covariance matrix of the errors. There are a few methods by which to do this and they tend to only work well for large samples. For small or medium sized samples, FGLS may be either inefficient or inconsistent. In those cases, most researchers would use OLS with solution (1) or (2). I'll introduce two of the methods here, which are so similar that Stata doesn't even give them unique commands. The Prais-Winsten method:
prais y x
and the Cochrane-Orcutt method:
prais y x, corc
Can you imagine how depressing it must be to work through all of the math involved in inventing a new method, only to have Stata just code your contribution as an option on another command? Poor Cochrane and Orucutt. At least their names got abbreviated together. Winsten just got left out entirely. Maybe my including a discussion of them here is some consolation. Probably not.
Of the components of time-series (trend, seasonality, cycle, and noise/other), so far we have discussed trend and seasonality. This is an introductory treatment of time series, so we're not going to talk much about cycles, as their irregularity makes them more complicated to isolate. The remaining component is "noise/other." This consists of short term fluctuations that are not part of any of the other components. Some of this variation is just meaningless for our purposes. That is, it's white noise. It could be measurement error (most macroeconomic indicators are estimates rather than precise values), small idiosyncrasies, or otherwise random (or "as if" random) variation. We generally don't care about this fluctuation, so we may want to remove it from the series. That's what smoothers (AKA filters) are for. You'll talk more about the theory of smoothers in class or read more about them in your text. Here, we're just going to focus on function. First, let's simulate some data
clear all
set obs 10000
set seed 529
gen whitenoise = rnormal(0,1)
gen x = _n
tsset x
ac whitenoise, lag(1000)
tsline whitenoise
Our new dataset consists of 10,000 observations of a randomly drawn variable "whitenoise" and a generic time variable. Because the whitenoise variable is randomly drawn, the ACF and time-series line graph show that there's no evidence of autocorrelation here, nor is there trend, seasonality, or cycle. Makes sense, right? A smoother tries to remove as much of the white noise as possible, while leaving the theoretically interesting bits intact. There's nothing theoretically interesting in this whitenoise variable, so we *should* see just a flat line from any of them. Stata can implement several common time-series smoothers with the tssmooth function. It takes the syntax:
tssmooth type [storage type] newvar = oldvar [, options]
In words, after telling Stata that you're implementing a smoother, you have to tell it what kind of smoother, (some times) how you want the variable to be stored, the new variable name, the original variable name, and then any additional options you might need.
Moving average smoother
tssmooth ma masmoothed = whitenoise, window(2 1 3)
The ",window(2 1 3)" option tells Stata that in calculating the moving average, you want to include two lagged observations, the current observation, and three leading observations. You can change these numbers, but it will change the result.
Exponential smoother
tssmooth exponential double expsmoothed = whitenoise
Double exponential smoother
tssmooth dexponential double dexpsmoothed = whitenoise
Nonseasonal Holt-Winters smoother
tssmooth hwinters hwmoothed = whitenoise
Now let's see what these filters actually produced. Remember that they *should* produce a flat line and since our randomly drawn variable was centered around 0, that flat line should also be at zero. Basically, if we look at a time-series line graph, we should see nothing.
tsline whitenoise masmoothed expsmoothed dexpsmoothed hwsmoothed
So what happened? All four of the smoothers got rid of some of the white noise, but some were more successful than others. All of them are more tightly centered around 0 than the raw variable. It actually looks like the exponential, double exponential, and non-seasonal Holt-Winters smoothers worked perfectly. The lines look flat, but let's zoom in and just look at one at a time.
tsline expsmoothed
tsline dexpsmoothed
tsline hwsmoothed
Do they look perfectly flat anymore? No. In fact, what they left looks theoretically interesting, even though we know there's nothing actually there, and each produced a different result. Why did this happen? This happened, because each method uses a different procedure to isolate and remove the white noise. The procedures are based on parameters and are sensitive to the values you choose for those parameters. We didn't tell Stata particular values to use, so it chose for us (you can set the parameters yourself as an option at the end of the command). Take another look at the first time-series line graph with all four smoothers. In the legend of the graph, you'll see some numbers in addition to the variable names. These are the parameters that Stata chose without asking us. The take away? Make sure you understand what these smoothers are doing and what the parameters mean before you implement them.
In cross-sectional data and in time-series data, our main variation is either across space or time, while holding the other constant (or at least almost constant). If we have important variation across both, then what we have is panel or time-series-cross-sectional (TSCS) data. Suppose that we have a nationally representative survey, but that we repeat the same survey once every year. If the survey goes to the same people every year, then this is panel data. If it is a fresh sample of different people every year, then this is TSCS data. These types of data tend to be discussed together or at least in the same course, as they present very similar challenges.
(to be continued)