Code and Programming Blog

This space is maintained for blog-style entries about coding challenges and experiences. I will do my best to keep this page updated with timely snapshots of challenges related to coding, data engineering and analysis, and related topics.

I also maintain a number of relevant pages/profiles on existing platforms (Kaggle, etc.), for which you can find links below:

Kaggle profile.

Github profile.

A novice's practical guide to text analysis in (mostly) R

Standing Summary

I am excited to finally be embarking on a useful and applied case of text analysis, mostly using the software language R. I've tinkered around the edges of text analysis for a while, playing with sandbox datasets and trying out different data commands, but there is just no replacement for a good, old-fashioned project to make you dive into the weeds.

I'm going to use this space to document that process, and it's going to be pretty comprehensive. As in, I'm going to start by documenting how I was finally able to install the topicmodels package on my machine after quite a long time of head-banging (the lame kind, not the metal kind).

Working with (pretty) big data

Introduction

Data is great! It allows empirical social scientists like myself to say things about the world. More precisely, it allows us to build and test models that themselves test some theory or theories. 

Increasingly, the datasets that social scientists use to test theories are becoming very large indeed--so large, in fact, that they require some creative thinking about how to manage and analyze them. I'd like to share a bit about a recent experience working with one of these large datasets: the California voter file, a database of voters registered in California, including the voter's voting history and basic characteristics about them (and their home/community). (NB: I should note right off the bat that, despite professional data analysis experience, my work mostly used data that was already in various environments, e.g., on-prem Oracle DB or AWS. Kudos to y'all data engineers out there who do this stuff in your sleep.)

The Challenge

My main task was to ingest, (lightly) process, and then share a dataset of roughly 23 million rows and nearly 1,000 variables. This data was being delivered to another data source that was going to append additional data and then share back the resulting (final) dataset for analysis. The caveat is that this appending must be done anonymously such that the source's data cannot be tied back to the identifying characteristics/variables on the original dataset.

The key problem is two-fold: (1) the original voter file actually comes in two distinct, tab-delimited files and (2) one of these files greatly exceeds my computer's working memory despite some hardware enhancements made for past projects. So, I can't simply throw a join at the two files and then export as a single .CSV or something.

Attempts Were Made

Folks familiar with this brand of problem will also be familiar with all sorts of different approaches; indeed, between various cloud-based storage tools like Google Drive and Dropbox, AWS, and so on, the tools on offer today are numerous and robust. Here is an approach that I tried.

Loading in the Data

A nice fact about my data challenge here is that I didn't actually need every single variable in the dataset, though I did need most of them. However, despite a decent-enough codebook--something that isn't always present, and underscores the generalizability of the approach I'm about to describe--I didn't want to rely on a human-coded document that conceivably has a timestamp and, thus, could be outdated. My solution was to leverage `fread()` in R to grab all the columns available in the data and essentially (1) verify and (2) parse out the actual, existing variables. I thus read in about 1,000 rows from each of the two voter file .TAB files (one on voting history and one on demographics), grabbed all the columns from both, and then organized these columns into categories. 

Here is what that looks like in code:

Why categories? Each category effectively serves as a subset of data that I am always able to pull up, access, and run analysis on. Having these categories of variables, I can now read in 100,000 rows of data--or 1 million, or 10 million, or all 23 million rows--but subset (or, in R parlance, selected) on the variables I actually need. For example, if I'm only interested in doing some analysis on elections or instead just want to focus on primary elections, I can load in 23 million rows of the pre-2010 election columns or the primary election columns. In fact, this gives you flexiblity to not only run analyses on subsets of data, but also allows you to export data as .CSV and other file types for preprocessing and analysis elsewhere (more on this in a bit). 

Here is an example of that:

At this point, I basically have a temporary dataframe (that I'm calling `csac_voting_part_vars_ing_test` in this case) that contains all the rows for some set of columns (a unique ID in the voter file and some basic set of demographics in this case). Great! Now comes the hairy part. I ended up trying a few different branches from here, so let me take them in order. 

Getting the data back out

Cranking out .CSVs in R

The first was to just spit out .CSVs from these temporary dataframes. That is, read in sets (or categories) of variables one at a time, output them as .CSV to some location, and then stitch them together somehow (more on this later). This approach worked surprisingly well from a reliability and scale standpoint. That is, for every category or set of variables I fed into R, I got back a neat (albeit colossal, and costly-to-produce) .CSV on a hard drive. (The code for this step is basically the same as the above, but see the endnotes if you want more detail).

It was nice not to have to worry about anything breaking--and it was to nice to on fairly familiar footing in an R environment--but at the end of the day the process for each .CSV took way too long. To produce .CSVs containing even 15-20 columns of the 23 million-row dataset--ranging in file size from about 4 to about 14 GB, I'd need about 30-60 minutes per dataframe, if not longer. Not ideal.

Shifting gears to Python + Parquet

Doing some head-scratching and digging, I got the idea to try out Python. In particular, I was curious if (1) Python would be faster at both the reading and the writing and (2) could give me more leverage with different file types. Roughly speaking, I think the answer to both of these has been 'yes.' However, I ended up running various encoding-related issues and other technical issues that were definitely much more annoying to troubleshoot than `fread` issues. After some finagling, I was able to get the following Python code to read in the aforementioned .TAB files (still leveraging sets of columns as before) and spit out Parquet files, which I plan to read in and combine using standard Python/pandas tools (more on this later).

Here is the code for that:

The Python approach was definitely not without its issues. Different runs of the same Python code would either run smoothly or, seemingly randomly--but probably based on some combination of the data and other stuff I had running in the background--break down with an error, almost always related to UTF encoding. It was a real bummer because Parquet files are definitely the way to go with this kind of thing in general, so I'm hoping to learn more about working with pandas + Parquet in particular. 

Vanilla Python approach

I didn't end up trying this, but there was probably a Parquet-less version of the above that could have worked well. Will definitely have to spend more time down this avenue in the future.

The Ultimate Solution

All told, I ended up going with what what had crossed my mind in the very beginning but which apparently needed to go through the above failures to be crystalized. I decided to just iterate across sets of rows using the `skip` and `nrows` parameters in `fread`, grabbing a set number of rows from each of the two voter file datasets (voting history and demographics), joining on the voter file unique ID, and then outputting a .CSV containing these joined observations. It's definitely not a super performant solution--the code takes waaaay too long (I'll be working on ways to speed it up for future versions/projects, and am curious to hear thoughts), and I don't know if it's elegant (though it's definitely simple), but it definitely works. 

Here is the function I wrote up to do the above:


You'll notice in the code above that I still got to use my named lists of columns, which was nice. And here is the code that I ran to actually generate the .CSVs (caveat emptor: I used a for loop, so you may want to avert your eyes or skip ahead*):

*I should also note that this loop generates the n - x row dataset, where n is the number of rows in the voter file and x is the number of rows in an initial dataset I generated 'the old fashioned way' to initialize the object that was eventually output.

The end result is just over 40 individual .CSV in a shared cloud environment, which are generated over the course of, well, many many hours. Definitely room for improvement.

Conclusion

All in all, this was an interesting but fairly time-intensive process. Starting from a place of rough familiarity with some of this stuff did not pay off as much as I thought it would. I don't think I saved much time over just reading into R and spitting out .CSVs, though it was cool to learn about different file structures and think about how I might improve this in the future. Overall, I'm definitely still looking for a better solution for this kind of thing.

Hope this is helpful to folks starting off in a similar place on their big-ish data journey!