Computation




I provide here a series of statistical packages which I have written for Stata (which can also be installed from the Stata SSC archive by typing  ssc install packagename  in Stata) along with relevant econometric programs I have built in other computer languages.  I also provide files to reproduce  econometric analysis in all current research along with the datasets consulted (or links to pages to download data).  Also, for general interest, I provide some information on software tools which I use regularly.



Stata Packages

    worldstat

           New version now available with faster graphical output and additional options.

worldstat is a module which allows for the current state of world development to be visualised in a computationally simple way. worldstat presents both the geographic and temporal variation in a wide range of statistics which represent the state of national development. While worldstat includes a number of "in-built" statistics such as GDP, maternal mortality and years of schooling, it is extremely flexible, and can (thanks to the World Bank's module wbopendata) easily incorporate over 5,000 other indicators housed in World Bank Open Databases. Program output in Stata:

An example output from world_stat.

Download ado and help file, or simply install directly from the command line in Stata typing   ssc install worldstat.


    mergemany

mergemany is an extension to the command merge, providing a flexible way for many 'using' datasets to be merged into one final dataset.  Merges can be performed based upon a user-defined list of files, by using the numerical regularity of file names, or by including all datasets of a given type stored in a single directory.  mergemany also allows the user to import and merge and arbirtrary number of .dta or non .dta in a single step.

Download ado and help file, or simply install directly from the command line in Stata typing   ssc install mergemany.


    erate

erate (written with Pavel Luengas Sierra) is a module which allows exchange rate conversion between any two currency pairs by consulting google's currency conversion tool to give up-to-date (up-to-minute) rate.  Exchange rates are automatically saved as an rlist and are readily accessible for future calculations.  Alternatively, the user can store results as variables for use in data manipulation involving exchange calculations. 

Download ado and help file, or simply install directly from the command line in Stata typing   ssc install erate.

    gets

gets is an algorithm for general-to-specific model prediction in Stata.  It is defined to search a large number of explanatory variables, and from these explanatory variables select the 'best' model based upon their relevance and power in explaining the dependent variable of interest.  gets implements a series of tests and search paths as outlined in the (growing) econometric literature on general-to-specific modelling.  For further details see "General to Specific Modelling in Stata" on the research section of this site.

Download ado and help file, or simply install directly from the command line in Stata typing   ssc install gets.

    ping

ping is a simple program to determine whether Stata is able to connect to the internet.  It is intended for use in scripts or programs where the user wishes to determine if internet connectivity is available prior to running a command (such as webuse, ssc, and other similar functions).  The user simply needs to type 'ping' at the command line to run, and is returned r(ping)="Yes" in an rlist if connected.

Download ado here.



Other Econometric Code

    Regression code written in C

The package regression.c allows the user to run a multi-variate regression using C, a low-level computer language.  The benefits of running regressions directly in C is that being compiled code, the compute may run considerably faster than interpreted code such as that used in most higher-level statistical languages.  Whilst stable, this is a work in progress and should be used with care.  Further details will be included shortly in a working paper on parallel computing on this site.

    Monte Carlo Simulations in Octave/MATLAB and in Stata

The following programs present a simple Monte Carlo simulation of two-stage least squares (IV) and ordinary least squares estimates in conditions where the exogeneity of the additive error terms does not hold (in OLS and IV estimates).  These simulations allow the user to determine the degree to which an instrumental variable may be correlated with the unobserved second-stage error term, and hence the degree to which IV estimates of the true parameter will be biased.  Program output for Octave or MATLAB is below (click for full size), and program files are available as Endog_MC.




             Endog_MC.m        Endog_MC.do






 


Data Generation

    The Demographic and Health Surveys

My work with Sonia Bhalotra on twin endogeneity and that focusing on maternal mortality uses cross-country data provided by the Demographic and Health Surveys.  These are a set of surveys collected in over 80 developing countries over 20+ years.  The cross-country dataset used must be constructed from approximately 1600 individual survey databases which are freely available to download (by application) at measuredhs.com.  Here I provide files which automate this process of downloading, unzipping, appending and merging the DHS country data.  This process requires two programs: a python script DHS_Import.py and a Stata file DHS_multicountry.do, along with a text file DHS_Countries.txt which should be saved in the folder where the programs are saved.  These programs are flexible and should run on any operating system provided access to the internet is available and python is installed.  For full details and instructions see the pdf document attached here.

For the latest version (which also downloads GIS data -- see below for coverage), download from here.






GPU Computation

The 'GPU' refers the Graphics Processing Unit included in computational systems like laptop and desktop computers.  These have been produced to render high-quality graphics, for example to run video games and other graphical programs.  However, these also offer significant advantages to users interested in running high-performance scientific computations via parallel processing.  Essentially, GPUs consist of a large number of cores which function like (a highliy simplified version of) the central processing unit (CPU) of a computer.  This is particularly useful for parallel computing.  Rather than parallelising jobs over the relatively small number of CPUs available to a computer, these can be run in parallel over the many more cores available on the GPU (off the shelf modern GPUs now contain > 1,000 cores).  In some situations this can offer significant speed advantages over traditional computing.

GPU computation has been simplified to run in many languages including R, Python and MATLAB.  There are also entire platforms dedicated to GPU computing, such as CUDA and OpenCL, which can act as extensions to C and C++ (among other things).

GPU Computing using Ubuntu with Optimus

The ease of installation of the necessary routines for GPU computing depends upon the operating system in question.  When installing this on Unix based systems running Optimus, significant challenges arise given the difficulty of integrating NVIDIA's proprietry drivers with a Unix OS.  However, the Bumblebee Project offers a significant improvement in this area.  Below I provide a document which outlines how I set up and run CUDA programs under Ubuntu 12.04 with Optimus.  I then provide an example of running MATLAB's specific GPU functions included in their parallel toolbox.

Installing and Running CUDA (and MATLAB's GPU functions) on Ubuntu 12.04 (pdf)




Tech Ecosystem 


A discussion on Marginal Revolution and related links demonstrates that economists use technology in a broad spectrum of ways.   Clearly each person's "tech ecosystem" depends upon their individual needs, the analysis and programs they commonly run, and in large part, comes down to personal preference.  I provide here a discussion of my tech ecosystem, not because I think that this is necessarily a way which should be adopted or the 'best' way to construct a computational system for economic analysis, but rather to shed some light on one possible setup which generally works reasonably efficiently.

My principal hardware is my laptop, a Samsung 550P5C.   This laptop is generally sufficient for me to undertake the majority of microeconometric analysis, and offers considerable performance in parallel computing given that it has four intel i7 processors, or eight virtual cores plus an Nvidia GPU with 96 cores.  I am not a user of tablets or mobile phones with internet connectivity, as I find that my laptop and a normal phone allows me sufficient functionality and connectivity for my needs.

In terms of software, I run ubuntu 12.04 as an operating system.  For statistical analysis I use Octave (a free and open source version of MATLAB), Stata, and R.  These statistical languages offer a range of similar tools, although each have their own strengths and weaknesses: Stata is excellent with traditional estimators such as OLS, IV, and many more, and is the program of choice of most of my colleagues; Octave and MATLAB are best to solve formal economic models in which maximisation or other mathematically modelable decisions occur; and R is very strong on visualisation, as well as integratability with other languages to (for example) present data online.  I am currently experimenting with Julia as a powerful and quick alternative to these languages.   I typeset documents using LaTeX - my preferred LaTeX GUI is Kile.  For general programming I use the text editor emacs.  When programming outside of statistical programs I generally use Python for its simplicity, however also work with the lower-level language C.  Given its transversal nature, C is particularly useful for linking with other programs such as Stata (via Stata plugins), the GPU (via OpenCL or CUDA), MATLAB (via .mex files) and many others.  I use the program Git for version control and link to bitbucket.org - an online repository for code.  For electronic back-up of documents I use a number of systems due to each system's restriction on maximum storage space.  These are Ubuntu One (5gb base), Dropbox (2gb base) and Wuala (5gb base).  I use open source software where possible given that these systems offer the greatest legal benefit to the largest possible groups of users.  Of those programs listed above the notable exception is Stata, given its widespread use by economists, and the important network effects which this implies.

If you have any questions about these programs, or any tips of software which I might like or find useful, I would be happy to hear from you.


With thanks to xkcd.  For extra credit, type  
import antigravity 
in your python interpreter.