econometrics21
BUS444E Econometrics for Business Research Spring 2021
Useful information about how to read data of different formats / file types into R here.
Pages with links to data from published papers, mostly in Stata .dta format:
Project: replication of Acemoglu, Johnson and Robinson (2001):
Lecture 1
Homework / preparation:
install R
install RStudio
read data into R
estimate a simple regression model
Links:
data set
description of data
Lecture 2
Homework / preparation:
Task 1
Load the data file maketable2.dta into R. This is a Stata data file (with extension .dta). Do this by going through the same steps as in last week's instructions from point 7 onwards, with names of files and directories modified appropriately. However, since the data is not a .csv file, replace step 11 with the following two lines
>library(foreign)
>data1 = read.dta("maketable2.dta")
(Useful link for importing different data formats)
Replicate column 1, Table 2 in Acemoglu, Johnson and Robinson (2001). That is, regress logpgp95 on avexpr. (When I do this I get almost but not exactly the same result as in the paper. For some reason the paper uses 110 observations, while 111 are left after dropping missing observations when I do it. Not sure which additional observation is dropped by the authors of the paper.)
Think about whether the coefficient on avexpr is the causal effect of average expropriation risk on income. Is avexpr endogenous? It might be worth reading (or skimming) p. 1369-1380 for some background.
Replicate Figure 2 in the paper with a simple scatter plot (circles instead of country name abbreviations).
Task 2
Load the data jtrain2.csv into R.
This is a data set from an experiment where men with difficult labour market histories where randomly assigned to either undergo a job market training program (train=1) or not (train=0) before 1978. The variable re78 shows their real income in 1978 measured in thousands of dollars.
How many of the people in the sample participated in the training program and how many did not?
Regress re78 on train. What is the interpretation of the coefficient on train? To get an idea of the magnitude, calculate the mean value of re78 in the sample by typing
>mean(data1$re78) # here it is assumed that you named the data frame data1 when importing the data
Find the mean of re78 for (a) all observations, (b) those who participated in training, (c) those who did not participate in training. A good way of doing this, that avoids problems with missing values, is the following, where re78 is the variable we want the mean of, and train is the one that determines the groups we want separate means for
>aggregate(re78~train, data=data1, mean)
What is the difference between the mean for train=1 and train=0? How does this compare to the OLS coefficient on re78 that you found above?
Is the R-squared small or large? Why? Does this tell us anything about whether re78 estimate represents are causal effect or not?
Task 3