Demog Blog‎ > ‎

new (for me) trick for working with big(ish) data in R

posted Sep 23, 2012, 3:45 PM by Tim Riffe   [ updated Sep 20, 2016, 4:29 AM ]
EDIT: updated for 2013 mortality data, and edited to account for changes in sqldf!
and again because the github gist gadgets keep changing.

I wouldn't qualify any of the kind of data I work with as 'big data', but it could easily get there. Even now, some files, like NCHS fixed width files for death and birth microdata (here), or many-Gb IPUMSi files can outstrip the memory on my ageing laptop. I've usually used the read.fwf() function for this, then end up saving the resulting data.frame as R binary files for easier reading later. That's still probably a decent way to go, but is in any case memory intensive. It's also unnecessary if all you're going to need is a custom table. The trick is to use sql from R, and I learned this from J D Long's answer to this question on StackOverflow. It fiddling on my part to figure out how to make it work with fixed width files. Here goes:

My example data are 2013 deaths from the US NCHS. In the zipped file there's a fixed width plain text file that's about 1.2 Gb as such, but get's much bigger when read into R as a data.frame. read.fwf() on this file for me could take 1.5 hrs+, but I doubt the process would even finish because I'd hot my 4Gb RAM ceiling. Let's say all I want is a cross table of some variables in there and I'm stubborn about getting this from R:

Here's let's say I want deaths by age, sex, month and the 39-cause recode. Look in the manual and find the start positions of these variables and their widths. We'll use an R package sqldf that externalizes the data-reading a table-making to SQL, and simply returns the results to R as a data.frame. SQL does its stuff out of memory apparently, so you're no longer held back by system RAM, and it's much faster:

Now this is something I can work with! [And many thanks to Gabor Grothendieck for the improvements and thanks to Enrique Acosta for reporting that my old code was broken!

Gist # 52a926c424e5023478c7

Let's break down what's happening in the sqldf() call, as it looks rather foreign to monolingual R users, such as myself. As you can see, we pass the SQL commands as a single text string. 
  • This string starts with select
  • The next elements, separated by commas, are the columns to include in the result.
    • The first, 'substr(V1, 160, 2) Cause', selects a variable 2 characters wide starting at position 160 in the flat file, and gives it the name Cause.
    • That's the form that the next few columns take too, until you get to Deaths, which is wrapped in a count() function. This is the part that actually does the tabulation- it doesn't really matter what variable you put in there, as far as I can tell, as the values don't matter- I think it just counts cases. In this file, one line is 1 case, so it's enough to just count lines, of something... So I just put in the substr() statement from one of the other variables. There's likely a better SQL-er way to do it, but this works for me for now.
  • The from f part tells R the name of the file connection used, 
  • group by is the table specification- Here it's really redundant, because we don't have column names to work with, but these are the same file positions of the grouping factors that end up in the columns.
This all looks much cleaner and parsimonious if you happen to be working with files that use a separator and have a column header!

NOTE: the Age column would require further parsing to be useful, as the first digit refers to the time unit (hours, days, months, years), and the next 3 digits are the count. Refer to the data documentation to sort through it!