I provide here a series of statistical packages which I have written for Stata (which can also be installed from the Stata SSC archive by typing ssc install packagename in Stata) along with relevant econometric programs I have built in other computer languages. I also provide files to reproduce econometric analysis in all current research along with the datasets consulted (or links to pages to download data). The most recent versions of my on-going research code are available publicly at my github page, which is updated frequently (https://github.com/damiancclarke). Also, for general interest, I provide some information on software tools which I use regularly.
rwolf calculates Romano and Wolf's (Econometrica 2005; JASA 2005) stepdown
adjusted p-values to correct for multiple hypothesis testing. This
program follows the algorithm described in Romano and Wolf (Statistics and Probability Letters 2016), and provides a p-value corresponding to
each of a series of J independent variables when testing multiple
hypotheses against a single dependent (or treatment) variable. The rwolf
algorithm constructs a null distribution for each of the J hypothesis
tests based on Studentized bootstrap replications of a subset of the
tested variables. Full details of the procedure are described in Romano
and Wolf (2016).
An example of this procedure is provided in Clarke and Muhlrad (2016).
Install from the command line in Stata typing ssc install rwolf.
plausexog implements Conley et al's (2012)
plausibly exogenous bound estimation in Stata. This allows for
statistical inference when a researcher believes that a potential
instrumental variable (IV) may be 'close to' but not precisely
exogenous. This package implements a number of methods described by
Conley et al., allowing for the relaxation of the traditional exclusion
restriction in IV methods.
This extends the original Conley at el. code (available on Christian Hansen's webpage)
to deal with graphing and wide range of data scenarios in Stata. I am
grateful to Christian Hansen for providing very useful comments. A pdf
with a few examples is available here, and a typical output is included below:
Download ado and help file, or simply install directly from the command line in Stata typing ssc install plausexog.
New version now available with faster graphical output and additional options.
worldstat is a module which allows for the current state of world development to be visualised in a computationally simple way. worldstat presents both the geographic and temporal variation in a wide range of statistics which represent the state of national development. While worldstat includes a number of "in-built" statistics such as GDP, maternal mortality and years of schooling, it is extremely flexible, and can (thanks to the World Bank's module wbopendata) easily incorporate over 5,000 other indicators housed in World Bank Open Databases. Program output in Stata:
Download ado and help file, or simply install directly from the command line in Stata typing ssc install worldstat.
mergemany is an extension to the command merge, providing a flexible way for many 'using' datasets to be merged into one final dataset. Merges can be performed based upon a user-defined list of files, by using the numerical regularity of file names, or by including all datasets of a given type stored in a single directory. mergemany also allows the user to import and merge and arbirtrary number of .dta or non .dta in a single step.
Download ado and help file, or simply install directly from the command line in Stata typing ssc install mergemany.
erate (written with Pavel Luengas Sierra) is a module which allows exchange rate conversion between any two currency pairs by consulting google's currency conversion tool to give up-to-date (up-to-minute) rate. Exchange rates are automatically saved as an rlist and are readily accessible for future calculations. Alternatively, the user can store results as variables for use in data manipulation involving exchange calculations.
genspec is an algorithm for general-to-specific model prediction in Stata. It is defined to search a large number of explanatory variables, and from these explanatory variables select the 'best' model based upon their relevance and power in explaining the dependent variable of interest. genspec implements a series of tests and search paths as outlined in the (growing) econometric literature on general-to-specific modelling. For further details see "General to Specific Modelling in Stata" in the Stata Journal, or an ungated copy on the research section of this site.
Download ado and help file, or simply install the latest version directly from the command line in Stata typing ssc install genspec.
arrowplot creates graphs showing inter- and intra-group variation by overlaying arrows for intra-group (regression) trends on a scatter plot. This is similar to the well known Stevenson-Wolfers happiness graphs. Program output is below (click for full-screen image): The ado and help file is available here, or (shortly) from the command line in Stata via ssc install arrowplot .
ping is a simple program to determine whether Stata is able to connect to the internet. It is intended for use in scripts or programs where the user wishes to determine if internet connectivity is available prior to running a command (such as webuse, ssc, and other similar functions). The user simply needs to type 'ping' at the command line to run, and is returned r(ping)="Yes" in an rlist if connected.
Download ado here.
Regression code written in C
The package regression.c allows the user to run a multi-variate regression using C, a low-level computer language. The benefits of running regressions directly in C is that being compiled code, the compute may run considerably faster than interpreted code such as that used in most higher-level statistical languages. Whilst stable, this is a work in progress and should be used with care. Further details will be included shortly in a working paper on parallel computing on this site.
Monte Carlo Simulations in Octave/MATLAB and in Stata
The following programs present a simple Monte Carlo simulation of two-stage least squares (IV) and ordinary least squares estimates in conditions where the exogeneity of the additive error terms does not hold (in OLS and IV estimates). These simulations allow the user to determine the degree to which an instrumental variable may be correlated with the unobserved second-stage error term, and hence the degree to which IV estimates of the true parameter will be biased. Program output for Octave or MATLAB is below (click for full size), and program files are available as Endog_MC.
The Demographic and Health Surveys
My work with Sonia Bhalotra on twin endogeneity and that focusing on maternal mortality uses cross-country data provided by the Demographic and Health Surveys. These are a set of surveys collected in over 80 developing countries over 20+ years. The cross-country dataset used must be constructed from approximately 1600 individual survey databases which are freely available to download (by application) at measuredhs.com. Here I provide files which automate this process of downloading, unzipping, appending and merging the DHS country data. This process requires two programs: a python script DHS_Import.py and a Stata file DHS_multicountry.do, along with a text file DHS_Countries.txt which should be saved in the folder where the programs are saved. These programs are flexible and should run on any operating system provided access to the internet is available and python is installed. For full details and instructions see the pdf document attached here.
For the latest version (which also downloads GIS data -- see below for coverage), download from here.
United States National Vital Statistics System Fetal Death Files
The United States National Center for Health Statistics (NCHS) makes available data for all births and fetal deaths occurring in the USA each year. For births, this data is available from 1968-2012 (early years are a 50% sample), and for fetal deaths from 1982-2012. This data is stored as fixed width text files (originally electronic tape). In order to read these files in to Stata, a dictionary file is required, which allows for crude text registers to be converted to Stata's dta format. The NBER makes available dictionary files for birth data (here), however, as far as I know, no similar resource was available for the fetal death files. I have written dictionary files to read fetal death data into Stata. The original fixed width text files of public use microdata can be downloaded from the USA CDC website. To convert this microdata to a Stata format (which can then be exported to csv for use in other languages), the following files should be downloaded: The README file contains instructions on how to run these files. It is reasonably simple, and just requires downloading the microdata, and then changing one local in the fetlNVCS.do file based on the location of the data on the user's machine. Upon running the do file in Stata, a full set of dta files will be produced (one per year), which contains all variables and observations in the familiar Stata format. The most updated version of these files are also available on github.
National Health Interview Survey
The National Health Interview Survey (NHIS) has been run in the USA since 1957 (more details here). Data is made available online at the CDC website in fixed width text file format. These can be read in to Stata format using dictionary files, which are made available for certain years online at the CDC website or at NBER. However, no centralised files are available for all years (since 1997 when the survey changed). Here I make available (thanks also to the work of Yu-Kuan Chen), full files to download all data automatically, and to read these all into Stata with two or three commands. All that is required to download and import all data to Stata format is Python (if you wish to download data automatically), and Stata to read crude dat files to Stata's dta format.
Full details are available in the README file in the folder below explaining comprehensively how to run these programs on Windows, Mac, or Unix operating systems. Apart from this, all that is required is the Python file NHISprocess.py in the zipped folder. If you just wish to download the dictionary files for Stata (1997-2013), they are available here. Files are also available on my github account, and are distributed for free use and alteration for any reason under the GNU General Public License.
The 'GPU' refers the Graphics Processing Unit included in computational systems like laptop and desktop computers. These have been produced to render high-quality graphics, for example to run video games and other graphical programs. However, these also offer significant advantages to users interested in running high-performance scientific computations via parallel processing. Essentially, GPUs consist of a large number of cores which function like (a highliy simplified version of) the central processing unit (CPU) of a computer. This is particularly useful for parallel computing. Rather than parallelising jobs over the relatively small number of CPUs available to a computer, these can be run in parallel over the many more cores available on the GPU (off the shelf modern GPUs now contain > 1,000 cores). In some situations this can offer significant speed advantages over traditional computing.
GPU computation has been simplified to run in many languages including R, Python and MATLAB. There are also entire platforms dedicated to GPU computing, such as CUDA and OpenCL, which can act as extensions to C and C++ (among other things).
GPU Computing using Ubuntu with Optimus
The ease of installation of the necessary routines for GPU computing depends upon the operating system in question. When installing this on Unix based systems running Optimus, significant challenges arise given the difficulty of integrating NVIDIA's proprietry drivers with a Unix OS. However, the Bumblebee Project offers a significant improvement in this area. Below I provide a document which outlines how I set up and run CUDA programs under Ubuntu 12.04 with Optimus. I then provide an example of running MATLAB's specific GPU functions included in their parallel toolbox.
Installing and Running CUDA (and MATLAB's GPU functions) on Ubuntu 12.04 (pdf)
A discussion on Marginal Revolution and related links demonstrates that economists use technology in a broad spectrum of ways. Clearly each person's "tech ecosystem" depends upon their individual needs, the analysis and programs they commonly run, and in large part, comes down to personal preference. I provide here a discussion of my tech ecosystem, not because I think that this is necessarily a way which should be adopted or the 'best' way to construct a computational system for economic analysis, but rather to shed some light on one possible setup which generally works reasonably efficiently. My principal hardware is my laptop, a Samsung 550P5C. This laptop is generally sufficient for me to undertake the majority of microeconometric analysis, and offers considerable performance in parallel computing given that it has four intel i7 processors, or eight virtual cores plus an Nvidia GPU with 96 cores. I am not a user of tablets or mobile phones, as I find that my laptop genearlly gives me sufficient functionality and connectivity. In terms of software, I run ubuntu 14.04 as an operating system. For statistical analysis I use Octave (a free and open source version of MATLAB), Stata, and R. These statistical languages offer a range of similar tools, although each have their own strengths and weaknesses: Stata is excellent with traditional estimators such as OLS, IV, and many more, and is the program of choice of most of my colleagues; Octave and MATLAB are best to solve formal economic models in which maximisation or other mathematically modelable decisions occur; and R is very strong on visualisation, as well as integratability with other languages to (for example) present data online. I am currently experimenting with Julia as a powerful and quick alternative to these languages. I typeset documents using LaTeX. For all my programming I use the text editor emacs. When programming outside of statistical programs I generally use Python for its simplicity, however also work with the lower-level language C. Given its transversal nature, C is particularly useful for linking with other programs such as Stata (via Stata plugins), the GPU (via OpenCL or CUDA), MATLAB (via .mex files) and many others. I use the program git for version control and link to github.org - an online repository for code. For physical back-up of documents I use the absolutely excellent program rsync to very quickly send things to a number of external drives. I use open source software where possible given that these systems offer the greatest legal benefit to the largest possible groups of users. Of those programs listed above the notable exception is Stata, given its widespread use by economists, and the important network effects which this implies.
If you have any questions about these programs, or any tips of software which I might like or find useful, I would be happy to hear from you.
With thanks to xkcd. For extra credit, type
in your python interpreter.