JieJayCao - Programs and Data

Programs and Data

Without hard working, efficient programming plays no role in success.

Tips for Programming and Data Work

1. SAS with SQL

2. Matlab

3. Stata

4. PERL

5. C

6. Other useful tools

7. Link financial databases

8. Use WRDS Unix server

1. SAS with SQL

With SQL procedure, SAS is really efficient in handling huge data sets such as NYSE TAQ database. With IML procedure, SAS can work on matrix operations, similar to Matlab or Gauss. With AUTOREG procedure, SAS can do fancy time-series regressions, similar to EViews. Compared with C or Matlab, efficiency might be a concern for SAS. I even conducted 12 million times EGARCH regressions using SAS on WRDS Unix server and it was finished much faster than expected. More importantly, SAS is almost a must-have tool for empirical work, particularly for empirical asset pricing. Here I have posted some SAS codes for research purpose.

- Winsorize: Have found outliers in your data? This SAS macro can easily winsorize or trim extreme observations.
- Correlation matrix: this SAS macro generates the time-series average of cross-sectional correlation matrix. Both Pearson and nonparametric Spearman correlations are reported.
- Single portfolio sorting: this SAS macro can help make a neat table in Excel for single portfolio sorting. The output includes raw return, CAPM alpha, Fama-French three-factor alpha and Carhart four-factor alpha. The Newey West t-stat is reported in bracket and the significance level for top-bottom difference is labeled in */**/***. The code works for daily/weekly/monthly frequency. See sample output 1.
- Double portfolio sorting: this SAS macro can help make a neat table in Excel for double portfolio sorting. The output includes raw return, CAPM alpha, Fama-French three-factor alpha and Carhart four-factor alpha. The Newey West t-stat is reported in bracket and the significance level for all top-bottom difference is labeled in */**/***. The code works for daily/weekly/monthly frequency. See sample output 2.
- Fama-MacBeth regressions: this SAS code conducts Fama-MacBeth regressions and reports the average coefficients as well as R-squares in a well-organized table format. The Newey West t-stat is reported in bracket and the significance level is labeled in */**/***. The time periods covered and the total observations involved are also listed. See sample output 3.
- This SAS example shows how to use the single portfolio sort macro and Fama-MacBeth regressions codes.

2. Matlab

Matlab was my favorite when I studied in the economics Ph.D. program. The most important thing I have learned using Matlab is how to make effective programming, which could be applied to any other languages. As an expert in matrix computation, Matlab provides freedom and full control over all desired statistical procedures or measures. It is also good at simulation and producing 3-dimentional figures. However, Matlab seems to have limitations on dealing with huge numeric / text data, where SAS / PERL stands out.

- Econometric Tools for Matlab (from James P. LeSage) is highly recommended. Though you may not find the exact function you want, you can make your own based on one of these open source codes.
- Mike Cliff's website lists a bunch of Matlab functions for finance research like finding Hansen-Jagannathan bounds and computing Newey-West adjusted standard errors.
- The formal mean-variance-spanning tests (for the shifts of efficient-frontiers) can be easily conducted and plotted in Matlab. It includes Huberman and Kandel (1987)'s likelihood ratio test. Ferson, Foerster and Keim (1993)'s GMM Wald test under conditional heteroskedasticity. Bekaert and Urias (1996)'s GMM Wald test under stochastic discount factor approach. Here is an example (Table and Figure) from my research project.
- In this computational finance course project, I wrote simple Matlab codes to simulate the growth paths of individual firms with transition shocks, and show how the cash holding-cash flow sensitivity (a proxy for financing constraints) changes over firm age, under frictionless capital market.

3. Stata

As a read-to-use software, Stata is small but very useful. It is an ideal tool to handle panel data and has been extensively used for research in corporate finance. There are many existing modules on statistical analysis and the priority is to figure out how to use them correctly. However, there is little flexibility for programming. Here are some tips on using Stata for empirical research.

- Mitchell Petersen's Programming Advice provides information and Stata codes on estimating standard errors of panel data. It includes Fama-MacBeth regressions, fixed effects, and bootstrapped standard errors, etc.
- Sometimes it is convenient to handle raw data in SAS and then perform statistical analysis in Stata. Stat/Transfer is a cute tool to switch the data types.
- Stata is easy to use but it is a little painful to save the outputs. OUTREG command can generate formulized results in a CSV file. See more details hereabout how to download and use it.
- For SAS fans to use Stata commands, here is a good start: Stata for the Struggling SAS Mind.

4. PERL

PERL is my best tool to parse text. It is fast and can read a huge amount of contents into buffer memory. PERL can be used to automatically collect data or download files from internet by parsing HTML source code. These could include numeric numbers, journal articles, finance news or statements, and SEC fillings.

- This is my first PERL code. It splits one large single file, with thousand of news from Wall Street Journal or Dow Jones News, into individual news files, and record relevant information such as news time, journals, tickers, and exchanges.
- In a research project, I made this simple PERL code to record the identification numbers of real-time auctions of Wii Game console on Ebay. Then I used another PERL code to retrieve all the bidding history information after these auctions ended.
- This tutorial shows how to download 10-K filings from SEC's EDGAR using PERL.

5. C

I learned C in late 1990s as my first programming language, by which I passed the Chinese National Computer Rank Examination (NCRE) Level-2. Though I like C's flexibility a lot, I rarely write C codes for economics / finance research since Matlab is already a good balance of flexibility and simplicity. However, understanding C is still quite useful. There are many famous numerical algorithms written by C and the ideas can be transferred to Matlab.

6. Others useful tools

- General inquirer: it is a computer-assisted approach for content analyses of textual data. The embedded dictionary can categorize the keywords into positive, negative, or neural. In finance research, the qualitative information of various finance news can be analyzed to forecast future earnings or returns. More details can be found in Prof. Paul Telock's 2008 JF paper.
- Readiris: it converts scanned documents from image to text information. Very useful for reading old documents like this example. Some manual training are needed to provide precise conversion.
- Scientific workplace: as an excellent alternative to Latex, it is easy to use and has strong computational capability. Here is my class notes of Time-Series made by scientific workplace.
- WinEdt: it is a classical editor for Latex and also works for PERL coding.

7. Link financial databases

This table lists the primary/secondary identifier for each database

General notes for linking databases:

- NCUSIP is the historical CUSIP and changes over time. CUSIP is the current NCUSIP. One NCUSIP is only valid for a specific period. A historical NCUSIP during a specific period will correspond to only one current CUSIP and PERMNO. [www.cusip.com]
- The NCUSIP in Thomson, I/B/E/S, ISSM, TAQ and Option-Metrics is labeled as 'CUSIP'.
- In Compustat, CNUM + first 2 digit of CIC is the CUSIP.
- The major matching variable across databases are NCUSIP and then Ticker.
- The SDC 6-digit NCUSIP can identify unique company (PERMCO), but not the unique common stock (PERMNO and 8-digit NCUSIP), if a company has issued multiple common stocks. The last two digits of a 8-digit NCUSIP are the issue code.
- The CUSIP-NCUSIP transition file builds a link between NCUSIP and CUSIP as well as PERMNO at a specified time interval. [Download the transition file here - updated till Dec 2019]
- For ISSM database, all NYSE and AMEX stocks from 1983 to 1992, and NASDAQ stocks after 1990 can be matched by NCUSIP. NASDAQ stocks before 1990 could be matched by SMBL, which at a given month & exchange corresponds to the Ticker in CRSP.
- For TAQ databse, stocks can be matched by the first 8 digits of TAQ's 12-digit NCUSIP.
- Mutual Fund Links (MFLINKS) connects CRSP mutual fund information to Thomson (S12) mutual fund holding data.
- Matching by company or fund name is difficult as the last resort. The SAS function 'SPEDIS" can determine the likelihood of two words matching.
- Extra efforts are needed for a precise matching. See this sample SAS Code to generate a link between I/B/E/S and CRSP using multiple identifiers. (Internet connection and access to both I/B/E/S and CRSP data at WRDS are required)

8. Use WRDS Unix server:

WRDS Unix server is very helpful for research. It not only lets you submit three SAS programs simultaneously, but also allows you to share data or files with other users. Here is a collection of my frequently used basic Unix commands on WRDS Unix server. [need WRDS account and SSH Secure Shell installed on local computer]