OLS

1. Interactive Visualization

http://setosa.io/ev/ordinary-least-squares-regression/

STATA Tutorial 1

1-1 STATA Interface

Once you have started Stata, you will see a large window containing several smaller windows. You can type in commands in the Stata Command window. These commands will be echoed in the Review window. They may be used again simply by clicking on the desired command in the Review window, which brings it back to the Stata Command window. Once a dataset is opened, variable names will appear in the Variables window. This is helpful especially when you have complicated variable names as it is possible to click on the variable name to enter it into Stata Commands rather than typing it out. Finally, results will appear in the Results window. Whenever the output from a command exceeds one page, you will see the word more at the bottom of the Stata Results window. Hit the spacebar to see the next page of output.

1-2 Stata Data Files

First of all, you need to import STATA Data Files (.dta or excel files) once you have started STATA. You can go to File->open and then find your data onyour computer (it’s an easier way). You may need to specify which row and column to start (sometimes the excel files come with description rows at the top. We want to drop these rows.)

Another way is to type commands into the Stata Command window.

In this document, STATA commands are written in Courier font to distinguish.

If you have .dta files as your dataset, use following code to import data (you have to change the path or directory of your data file, depends on where it is on your local computer):

use ""G:\Econ382\data sets\wage.dta"" [Hit Enter]

or, if you have excel files as your dataset, try this:

import excel " G:\Econ382\data sets\wage.xlsx", sheet(".") firstrow

NOTE 1: Stata is case sensitive. So use is not the same as Use.

NOTE 2: The bold courier font is for Stata command. The italic courier font is for filenames that you need to modify for your use.

Now Stata imports a data file (in this case wage.xlsx) by the above command. "G:\Econ382\data sets is a full path where the data file wage.xlsx is located in your computer. If you are not sure about your data file’s full path, click the right button in a mouse while putting a mouse-pointer over the data file, and then choose ‘Properties’.

use "G:\Econ382\data sets\wage.dta",clear [Hit Enter]

This command is the same as the previous command except including the option clear. This option clears previously imported data in your computer’s memory. For most cases, this is necessary to save your computer’s physically limited memory (say, 256MB memory). An alternative way is just type “clear” in your command window--it will drop all imported data so far—and then import a new data file.

Once you imported the data file, you can see the imported data by clicking a “Data Browser” icon.

[Figure 1-1] Data Viewer Icon

1-3 Stata Basic Command

Before studying how to run the regressions, in this section, we will study some basic commands in Stata.

To describe the dataset, try this:

describe [Hit Enter]

(or for short, des)

Below is the result:

To summarize the variables, use:

summarize [Hit Enter]

(or for short, sum)

Below is the result:

twoway scatter Ahe Age [Hit Enter]

This command gives us a scatter diagram of Ahe (average hourly earning) for the vertical axis and Age for the horizontal axis. It may take several minutes, depending on your computer’s performance and data size.

1-4 The simple OLS Regression

regress Ahe Age [Hit Enter]

(or for short, use reg instead of regress)

This command gives us the results OLS regression, taking wage as dependent variable and educ as an independent variable:

Here is the result.

1.05 gives you the estimator of the slope and -10.78 gives you the estimator of the intercept. 0.345 and 10.42 are the standard error of the estimator of the slope and the intercept, respectively. And R-squared 0.0864 is reported too. It also provides you the t-statistics if the hypothesis is 0 for both estimators. Also, you will see the 95% confidence interval for each estimator.

1-5 Generate new variables

In most cases, we don’t have all the variables we need to use in the original data set. What we need to do is to generate some new variables.

Use the command generate [new variable name]=f(existing variable)

Example: generate logage=log(Age)

Here logage is the new variable consisting of the numbers of log(Age) for each observation.

Sometimes the log transformation may not be suitable if we have 0s or negative values. Under these circumstances, STATA will treat the new values as missing.

Note: you have to use a new variable name that has never been used. Otherwise, STATA will report an error.

If this happens, simply use the command: drop [variable name you want to use] before.

Example:

drop logage

generate logage=log(Age)

After you get the new variable, you can use this variable whenever you want. For example:

regress Ahe logage [Hit Enter]

1-6 help function

If you don’t know what does the command stand for, you can always use the help function. For example, if you want to get a full description of command “regress”,simply use the following code:

help regress[Hit Enter]

1-7 do file

STATA has a built-in text editor called “do file”. Do file is like a scratch paper, where you can put your codes here for future use. If you work on a big project which needs hundreds of lines of codes, you should use do file for record. Codes in do files could be executed directly by STATA. Just select the codes you want to run, and click on execute(do).

STATA Tutorial 2

2-1 Summarize the variables you want only

Still, first you need to open the dataset, and then use the command “summarize”:

summarize variable1 variable2 [Hit Enter]

For example, in our dataset “wage”, we have 5 variables. ahe means the average hourly earnings, year means the sample year, bachelor is a dummy and takes value 1 when the observation has college degree and takes value 0 if not. Female is a dummy and takes value of 1 if the observation is female and takes value of 0 if male. age is the age of the observation.

Since year always equals to 2008 (data are from census 2008), we can type

summarize ahe bachelor female age [Hit Enter]

Below are the results:

Or even you can get more details by adding the option “,detail” after the command. For example,

summarize ahe, detail

2-2 OLS Regression with Multiple Regressors

Liner regression with multiple regression:

We can run the OLS by the following STATA command:

regress Y X1 X2 … Xk

For example, we want run regression of average hourly earnings on bachelor, female and age, then we simply type

regress ahe bachelor female age [Hit Enter]

Here is the result:

How to explain these are just the same as the simple regression model. The only difference is that we have multiple slope estimators. Each slope estimator is the number after that variable’s name. And also you can get R-squared, adjusted R-squared, SER (denoted as Root MSE).

You can run models whatever you are interested in. For example, if you want to run the model: log(ahe)=β0+β1age+ β2age2+ β3bechelor+u

First generate new variables:

generate log_ahe=log(ahe)

generate age2=age^2

regress log_ahe age age2 bachelor

2-3 The help command and abbreviations

You may use the help function if you need more detailed information about a specific command. For example, if you want to know what you could do with the command “regress”, you can simply type:

help(regress)

You will see something like this:

Following the Syntax, we could find underscores below “reg”. That means for short, you could use “reg” instead of typing the full command “regress”, and STATA will recognize it.

vce tells you what kind of standard error you want STATA to report. Remember in the class we discussed that STATA by default reports homoscedastic standard errors for each coefficient? If we want heteroskedastic standard errors, we can specify it in the regress command. For example, type:

regress ahe bachelor female age, robust [Hit Enter]

Result like this:

You could see the coefficients are exactly the same as before. However, the standard errors change, and it affect t-stats, p-values and confidence intervals also.

For your reference, here is the previous result, with homoscedastic standard errors.

Page updated

Google Sites

Report abuse