R Code

Section Contents
  • Why R is awesome for research
  • R code to estimate standard errors
  • R code to fit exactly-identified GMM model
  • R code to replicate Fama-French factors (size, value) plus momentum
  • Other miscellaneous R code (e.g. fit VAR model)

In Praise of R and RStudio

I have tried many statistical software programs (e.g. Matlab, STATA, SAS, SPSS) and have settled on R for my research needs. Specifically, RStudio allows me to complete all my research steps within a single program. It is also open source, free, and has a huge community of users who answer questions on StackOverflow. All the following research steps can be done:

  • Import Data: Read data files from other programs. Directly access data from WRDS (no need to use SAS). Directly download data from Quandl and other online databases.
  • Analyze Data: Run almost any statistical routine you'll ever need. Cutting edge techniques are usually first made available through R packages.
  • Plot Data and Create Tables: Generate all your figures and tables directly inside the program. R is renowned for its wide range of plotting capabilities. There's also packages that directly convert summary results or regression output into tables. Never again copy and paste. 
  • Write Text and Equations: RStudio supports RMarkdown, which is an easy way to write text and code that generates beautiful LaTeX output. You can write equations and even run code to embed specific numbers into your text. There's no need for MS Word, but you still have the option to generate word documents. Compile PDF files directly. 
  • Use Version Control: Track changes in different versions of your programs and allow for easy online collaboration. RStudio has version control capabilities built into it, so there's no need for a separate program. Supports both GIT and SVN.
  • Add References: Compile one bibliography and use it for all your papers (e.g. can use BibTeX and auto-generate it from Mendeley or copy and paste from Google Scholar). Use shorthand to cite papers within the text of your paper, and it will show up properly formatted. Auto-compile a References list at the end of the paper that includes all cited papers.
  • Make Presentations: Supports Beamer. Use the same code for equations, plots, and tables you've already written. Saves time and ensures all output is consistent between your paper and presentation. 
The final result allows me to click a single button, and a PDF of my paper or presentation can be generated by re-running all the analyses and re-generating all the figures and tables from scratch. RStudio does all of this, so there's no need to use any other program.



Code for Standard Errors
# R approaches to replicate Petersen '09
# Rmd file and edits kindly provided by Robert McDonald
library(foreign)
df.petersen <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

# OLS
summary(lm(y~x, data=df.petersen))

# White 
library(sandwich)
library(lmtest)
model <- lm(y~x, data=df.petersen)
coeftest(model, vcovHC(model, type="HC0"))

# Clustered
library(lfe)
summary(felm(y ~ x, data=df.petersen)) # same as OLS
summary(felm(y ~ x, data=df.petersen), robust=TRUE) # same as White
summary(felm(y ~ x | 0 | 0 | firmid, data=df.petersen)) # cluster by firm only
summary(felm(y ~ x | 0 | 0 | year, data=df.petersen)) # cluster by year only
summary(felm(y ~ x | 0 | 0 | firmid + year, data=df.petersen)) # cluster by both year and firm

# Fama-Macbeth
library(plm)
summary(pmg(y ~ x, data=df.petersen, index=c("year","firmid"))) # note that N and T are reversed to take advantage of the Mean Groups estimator; must also avoid NA's

# Newey-West 
# Petersen data not time-series so not appropriate but can use as example anyway
library(sandwich)
library(lmtest)
model <- lm(y~x, data=df.petersen)
coeftest(model, NeweyWest(model, lag = 1, prewhite = FALSE)) # Newey-West '87 1 lag
    # 0 lag should be same as White 
coeftest(model, NeweyWest(model)) # Newey-West '94 auto-lag
floor(bwNeweyWest(model)) # provides lag parameter used by Newey-West '94
coeftest(model, NeweyWest(model, lag = floor(bwNeweyWest(model)), prewhite = TRUE)) # confirm same results

# Fixed Effects
library(lfe)
summary(felm(y ~ x | firmid + year | 0 | 0, data=df.petersen)) # firm + year fixed effects
summary(felm(y ~ x | firmid + year | 0 | firmid + year, data=df.petersen)) # fixed effects with clustering

GLS
library(nlme)
summary(gls(y~x, data=df.petersen))

# Bootstrapped
library(dplyr)
library(broom)
df.boot <- do(bootstrap(df.petersen, 1000), tidy(lm(y ~ x, .))) # 1000 sims
mean(df.boot[df.boot$term=="x",]$estimate) # mean coef
sd(df.boot[df.boot$term=="x",]$estimate) # standard error



Code for GMM (Exactly-Identified Example)
library(gmm)
data(Finance)
df.fin <- data.frame(rm=Finance[1:500,"rm"], rf=Finance[1:500,"rf"])

# gmm package author Pierre Chaussé recommends using formula approach for linear problems
summary(gmm(rm ~ rf, ~rf, data=df.fin))

# standard approach
g <- function(theta, x) {
  m.1 <- x[,"rm"] - theta[1] - theta[2]*x[,"rf"]
  m.z <- (x[,"rm"] - theta[1] - theta[2]*x[,"rf"])*x[,"rf"]
  f <- cbind(m.1, m.z)
  return(f)
}
summary(gmm(g, df.fin, t0=c(0,0), method = "BFGS", control=list(fnscale=1e-8)))
summary(gmm(g, df.fin, t0=c(0,0), method = "BFGS", control=list(fnscale=1e-8), kernel="Bartlett", bw=bwNeweyWest)) # use Newey-West '94 standard errors



Code to Replicate Fama-French 3-Factor Model (plus Momentum)
The code replicates the construction of Mkt, SMB, HML, and UMD factors from scratch, using Ken French's methodologyI assume you can connect to COMPUSTAT and CRSP through WRDS. My results aren't perfect though. From July '26 through Dec '16, I get 99% correlations but under-estimate the size and value premiums by ~1.5/bps per month.

If anyone has suggestions for improvement or can point to other replicating code, please let me know
(I already know about the SAS code from WRDS).

Data Collected from Ken French Site Used in Code (Davis Book Data, Fama-French 4-Factor returns)



Other Miscellaneous Code

  • Fit ARMA Model using forecast package
  • Fit VAR model using vars package (based on Bernhard Pfaff's code and book)
  • Kendall and Stambaugh bias adjustment (using Amihud-Hurvich '04)
  • Fast/simple way to compound returns
  • Augmented Diebold-Mariano Test to assess predictability
  • Encompassing Test to assess predictability

Please contact me if you find errors or have suggestions about the code I've posted. However, I do not review others' code or help with student assignments (check Stack Overflow or Cross Validated instead).