Section Contents- Why R is awesome for research
- R code to estimate standard errors
- R code to fit exactly-identified GMM model
- R code to replicate Fama-French factors (size, value) plus momentum
- Other miscellaneous R code (e.g. fit VAR model)
In Praise of R and RStudioI have tried many statistical software programs (e.g. Matlab, STATA, SAS, SPSS) and have settled on R for my research needs. Specifically, RStudio allows me to complete all my research steps within a single program. It is also open source, free, and has a huge community of users who answer questions on StackOverflow. All the following research steps can be done: **Import Data**: Read data files from other programs. Directly access data from WRDS (no need to use SAS). Directly download data from Quandl and other online databases.**Analyze Data**: Run almost any statistical routine you'll ever need. Cutting edge techniques are usually first made available through R packages.**Plot Data and Create Tables**: Generate all your figures and tables directly inside the program. R is renowned for its wide range of plotting capabilities. There's also packages that directly convert summary results or regression output into tables. Never again copy and paste.**Write Text and Equations**: RStudio supports RMarkdown, which is an easy way to write text and code that generates beautiful LaTeX output. You can write equations and even run code to embed specific numbers into your text. There's no need for MS Word, but you still have the option to generate word documents. Compile PDF files directly.**Use Version Control**: Track changes in different versions of your programs and allow for easy online collaboration. RStudio has version control capabilities built into it, so there's no need for a separate program. Supports both GIT and SVN.**Add References**: Compile one bibliography and use it for all your papers (e.g. can use BibTeX and auto-generate it from Mendeley or copy and paste from Google Scholar). Use shorthand to cite papers within the text of your paper, and it will show up properly formatted. Auto-compile a References list at the end of the paper that includes all cited papers.**Make Presentations**: Supports Beamer. Use the same code for equations, plots, and tables you've already written. Saves time and ensures all output is consistent between your paper and presentation.
The final result allows me to click a single button, and a PDF of my paper or presentation can be generated by re-running all the analyses and re-generating all the figures and tables from scratch. RStudio does all of this, so there's no need to use any other program. Code for Standard Errors# R approaches to replicate Petersen '09 # Rmd file and edits kindly provided by Robert McDonald library(foreign) df.petersen <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta") # OLSsummary(lm(y~x, data=df.petersen)) # White library(sandwich) library(lmtest) model <- lm(y~x, data=df.petersen) coeftest(model, vcovHC(model, type="HC0")) # Clusteredlibrary(lfe) summary(felm(y ~ x, data=df.petersen)) # same as OLS summary(felm(y ~ x, data=df.petersen), robust=TRUE) # same as White summary(felm(y ~ x | 0 | 0 | firmid, data=df.petersen)) # cluster by firm only summary(felm(y ~ x | 0 | 0 | year, data=df.petersen)) # cluster by year only summary(felm(y ~ x | 0 | 0 | firmid + year, data=df.petersen)) # cluster by both year and firm # Fama-Macbethlibrary(plm) summary(pmg(y ~ x, data=df.petersen, index=c("year","firmid"))) # note that N and T are reversed to take advantage of the Mean Groups estimator; must also avoid NA's # Newey-West # Petersen data not time-series so not appropriate but can use as example anyway library(sandwich) library(lmtest) model <- lm(y~x, data=df.petersen) coeftest(model, NeweyWest(model, lag = 1, prewhite = FALSE)) # Newey-West '87 1 lag # 0 lag should be same as White coeftest(model, NeweyWest(model)) # Newey-West '94 auto-lag floor(bwNeweyWest(model)) # provides lag parameter used by Newey-West '94 coeftest(model, NeweyWest(model, lag = floor(bwNeweyWest(model)), prewhite = TRUE)) # confirm same results # Fixed Effectslibrary(lfe) summary(felm(y ~ x | firmid + year | 0 | 0, data=df.petersen)) # firm + year fixed effects summary(felm(y ~ x | firmid + year | 0 | firmid + year, data=df.petersen)) # fixed effects with clustering # GLSlibrary(nlme) summary(gls(y~x, data=df.petersen)) # Bootstrappedlibrary(dplyr) library(broom) df.boot <- do(bootstrap(df.petersen, 1000), tidy(lm(y ~ x, .))) # 1000 sims mean(df.boot[df.boot$term=="x",]$estimate) # mean coef sd(df.boot[df.boot$term=="x",]$estimate) # standard error Code for GMM (Exactly-Identified Example)library(gmm) data(Finance) df.fin <- data.frame(rm=Finance[1:500,"rm"], rf=Finance[1:500,"rf"]) # gmm package author Pierre Chaussé recommends using formula approach for linear problems summary(gmm(rm ~ rf, ~rf, data=df.fin)) # standard approach g <- function(theta, x) { m.1 <- x[,"rm"] - theta[1] - theta[2]*x[,"rf"] m.z <- (x[,"rm"] - theta[1] - theta[2]*x[,"rf"])*x[,"rf"] f <- cbind(m.1, m.z) return(f) } summary(gmm(g, df.fin, t0=c(0,0), method = "BFGS", control=list(fnscale=1e-8))) summary(gmm(g, df.fin, t0=c(0,0), method = "BFGS", control=list(fnscale=1e-8), kernel="Bartlett", bw=bwNeweyWest)) # use Newey-West '94 standard errors Code to Replicate Fama-French 3-Factor Model (plus Momentum)The code replicates the construction of Mkt, SMB, HML, and UMD factors from scratch, using Ken French's methodology. I assume you can connect to COMPUSTAT and CRSP through WRDS. My results aren't perfect though. From July '26 through Dec '16, I get 99% correlations but under-estimate the size and value premiums by ~1.5/bps per month. If anyone has suggestions for improvement or can point to other replicating code, please let me know (I already know about the SAS code from WRDS). Fama-French Replication Support Functions (must execute first) Data Collected from Ken French Site Used in Code (Davis Book Data, Fama-French 4-Factor returns) Other Miscellaneous Code- Fit ARMA Model using forecast package
- Fit VAR model using vars package (based on Bernhard Pfaff's code and book)
- Kendall and Stambaugh bias adjustment (using Amihud-Hurvich '04)
- Fast/simple way to compound returns
- Augmented Diebold-Mariano Test to assess predictability
- Encompassing Test to assess predictability
Please contact me if you find errors or have suggestions about the code I've posted. However, I do not review others' code or help with student assignments (check Stack Overflow or Cross Validated instead). |