Cecile Murray-US Census Bureau

Title: Profiling statistical programming language performance for large-scale data tasks

Abstract:

Large-scale data processing and analysis is not a new challenge for the U.S. Census Bureau, but the number of statistical programming languages and tools available to perform such work has expanded dramatically in recent years. Researchers often execute standard data management steps before beginning their analyses, such as loading and merging datasets, but they can implement these steps in very different ways and in different programing languages. In order to help develop a set of recommended practices for Census Bureau researchers, we evaluate how statistical programming languages perform on a common data management task within the Census’s Bureau’s high-performance computing cluster. Specifically, we develop Python, SAS, Stata, and R scripts that merge the person, household, and geographic microdata from the full-count 1990 Census microdata files. We then use these merged data to perform basic analyses such as counting the number of individuals per household and calculating the average household size for every county in the U.S. We compare the different language implementations of these scripts based on runtime and memory usage for each task. We also explore the effect of parallelizing the scripts to leverage available cluster resources.