R Code
Section Contents
Why R is awesome for research
R code to estimate standard errors
R code to fit exactly-identified GMM model
R code to replicate Fama-French factors (size, value) plus momentum
Other miscellaneous R code (e.g. fit VAR model)
In Praise of R and RStudio
I have tried many statistical software programs (e.g. Matlab, STATA, SAS, SPSS) and have settled on R for my research needs. Specifically, RStudio allows me to complete all my research steps within a single program. It is also open source, free, and has a huge community of users who answer questions on StackOverflow. All the following research steps can be done:
Import Data: Read data files from other programs. Directly access data from WRDS (no need to use SAS). Directly download data from Quandl and other online databases.
Analyze Data: Run almost any statistical routine you'll ever need. Cutting edge techniques are usually first made available through R packages.
Plot Data and Create Tables: Generate all your figures and tables directly inside the program. R is renowned for its wide range of plotting capabilities. There's also packages that directly convert summary results or regression output into tables. Never again copy and paste.
Write Text and Equations: RStudio supports RMarkdown, which is an easy way to write text and code that generates beautiful LaTeX output. You can write equations and even run code to embed specific numbers into your text. There's no need for MS Word, but you still have the option to generate word documents. Compile PDF files directly.
Use Version Control: Track changes in different versions of your programs and allow for easy online collaboration. RStudio has version control capabilities built into it, so there's no need for a separate program. Supports both GIT and SVN.
Add References: Compile one bibliography and use it for all your papers (e.g. can use BibTeX and auto-generate it from Mendeley or copy and paste from Google Scholar). Use shorthand to cite papers within the text of your paper, and it will show up properly formatted. Auto-compile a References list at the end of the paper that includes all cited papers.
Make Presentations: Supports Beamer. Use the same code for equations, plots, and tables you've already written. Saves time and ensures all output is consistent between your paper and presentation.
The final result allows me to click a single button, and a PDF of my paper or presentation can be generated by re-running all the analyses and re-generating all the figures and tables from scratch. RStudio does all of this, so there's no need to use any other program.
Code for Standard Errors
# R approaches to replicate Petersen '09
# Rmd file and edits kindly provided by Robert McDonald
library(foreign)
df.petersen <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
# OLS
summary(lm(y~x, data=df.petersen))
# White
library(sandwich)
library(lmtest)
model <- lm(y~x, data=df.petersen)
coeftest(model, vcovHC(model, type="HC0"))
# Clustered
library(lfe)
summary(felm(y ~ x, data=df.petersen)) # same as OLS
summary(felm(y ~ x, data=df.petersen), robust=TRUE) # same as White
summary(felm(y ~ x | 0 | 0 | firmid, data=df.petersen)) # cluster by firm only
summary(felm(y ~ x | 0 | 0 | year, data=df.petersen)) # cluster by year only
summary(felm(y ~ x | 0 | 0 | firmid + year, data=df.petersen)) # cluster by both year and firm
# Fama-Macbeth
library(plm)
summary(pmg(y ~ x, data=df.petersen, index=c("year","firmid"))) # note that N and T are reversed to take advantage of the Mean Groups estimator; must also avoid NA's
# Newey-West
# Petersen data not time-series so not appropriate but can use as example anyway
library(sandwich)
library(lmtest)
model <- lm(y~x, data=df.petersen)
coeftest(model, NeweyWest(model, lag = 1, prewhite = FALSE)) # Newey-West '87 1 lag
# 0 lag should be same as White
coeftest(model, NeweyWest(model)) # Newey-West '94 auto-lag
floor(bwNeweyWest(model)) # provides lag parameter used by Newey-West '94
coeftest(model, NeweyWest(model, lag = floor(bwNeweyWest(model)), prewhite = TRUE)) # confirm same results
# Fixed Effects
library(lfe)
summary(felm(y ~ x | firmid + year | 0 | 0, data=df.petersen)) # firm + year fixed effects
summary(felm(y ~ x | firmid + year | 0 | firmid + year, data=df.petersen)) # fixed effects with clustering
# GLS
library(nlme)
summary(gls(y~x, data=df.petersen))
# Bootstrapped
library(dplyr)
library(broom)
df.boot <- do(bootstrap(df.petersen, 1000), tidy(lm(y ~ x, .))) # 1000 sims
mean(df.boot[df.boot$term=="x",]$estimate) # mean coef
sd(df.boot[df.boot$term=="x",]$estimate) # standard error
Code for GMM (Exactly-Identified Example)
library(gmm)
data(Finance)
df.fin <- data.frame(rm=Finance[1:500,"rm"], rf=Finance[1:500,"rf"])
# gmm package author Pierre Chaussé recommends using formula approach for linear problems
summary(gmm(rm ~ rf, ~rf, data=df.fin))
# standard approach
g <- function(theta, x) {
m.1 <- x[,"rm"] - theta[1] - theta[2]*x[,"rf"]
m.z <- (x[,"rm"] - theta[1] - theta[2]*x[,"rf"])*x[,"rf"]
f <- cbind(m.1, m.z)
return(f)
}
summary(gmm(g, df.fin, t0=c(0,0), method = "BFGS", control=list(fnscale=1e-8)))
summary(gmm(g, df.fin, t0=c(0,0), method = "BFGS", control=list(fnscale=1e-8), kernel="Bartlett", bw=bwNeweyWest)) # use Newey-West '94 standard errors
Code to Replicate Fama-French 3-Factor Model + RMW, CMA, and Momentum
The code replicates the construction of Mkt, SMB (3-factor version), HML, RMW, CMA, and UMD factors from scratch, using Ken French's methodology. I assume you can connect to COMPUSTAT and CRSP through WRDS. My results aren't perfect though. From July '26 through Dec '17, I get 98%+ correlations but my SMB, HML, RMW, and UMD premiums are off by ~1 bps/month while my CMA premium is off by 5.5 bps/month.
Data Collected from Ken French Site Used in Code (Davis Book Data, Fama-French Factor returns)
*** Dec 2020 Update ***
I received an email from Victoria Voigts who identified improvements to better fit CMA. Unfortunately, since I haven't been updating my code, I have not included her revision in my original code. Here's her comment though:
When merging your linkfile with the compustat annual file, you need to change the filter to
Datatable: .[n==1 | (n>1 & datadate >= linkdt & (datadate <= linkenddt | is.na(linkenddt)))]
or in dyplr: filter(n==1 | (n>1 & datadate >= linkdt & (datadate <= linkenddt | is.na(linkenddt)))),
where n is the number of unique permno/gvkey-matches in the linkfile:
Datatable: .[, n := as.numeric(length(gvkey)), by='permno']
or in dyplr: group_by(permno) %>% mutate(n = length(gvkey)) %>% ungroup
otherwise you’re losing a few observations that are crucial for the CMA replication, apparently.
***Mar 2021 Update***
I'm no longer doing research so have no plans to update my code. However, I will continue posting emails from people who have improvement suggestions. The following is from Daniele Ballinari:
I noticed a potential problem. You retrieve the information about the exchange and share code from CRSP.MSE. I noticed that, at least for myself, the data I obtain from this table might contain more than one exchange or share code for a PERMNO and month (e.g. the stock changes the listing within the month). In your code, the data frame crsp.mse might therefore contain more than one row for each PERMNO-Date pair. When you later merge this data frame with the other CRSP data, you might run into trouble, as the resulting merged data frame also contains more than one observation for certain PERMNO and Date observations. This becomes problematic when you use na.locf on share and exchange codes.
Other Miscellaneous Code
File containing useful time-series and predictability related functions:
Fit ARMA Model using forecast package
Fit VAR model using vars package (based on Bernhard Pfaff's code and book)
Kendall and Stambaugh bias adjustment (using Amihud-Hurvich '04)
Fast/simple way to compound returns
Augmented Diebold-Mariano Test to assess predictability
Encompassing Test to assess predictability
Please contact me if you find errors or have suggestions about the code I've posted. However, I do not review others' code or help with student assignments (check Stack Overflow or Cross Validated instead).