Julia's Blog - Applying Data Science to Research in Public Policy at Columbia

Applying Data Science to Research in Public Policy

June 26 - August 11 // Columbia University // Julia Guo '18

7/3/2017

Hey guys! I'm Julia Guo, a rising senior, and I'm spending 7 weeks of my summer at Columbia University's Statistics department, where I'm conducting data science research on the effectiveness of New York City's MWBE (Minority and Women-Owned Business Enterprise) certification program, under the mentorship of Jon Auerbach, a Statistics PhD student who has done a lot of research work for the city. Our clients are a couple of NYC departments - Small Business Services (SBS), Mayor's Office of Contract Services (MOCS), and Design & Construction (DDC). DDC runs Town+Gown, which is the research partnership between universities and NYC.

The MWBE program works by providing certified businesses with resources to help them grow, and better access to city contracts. It was created after someone sued the city for discrimination - 95% of the people the city does business with is white men, and historically it was much lower, since the Italian mob and other groups had a lot of political clout back in the day. Read more about the program here. The program seems to be doing well and is often praised in media and on subway advertisements. The city wants to know in more detail how well it's doing, though.

Jon and I suspect that the certification program may be less effective than it seems, and that confounding variables are actually the cause of the program's apparent success. Basically, we think that businesses that are already "good" are selecting into the program and skewing the outcomes. Quantifying the program's degree of success is really important since it's funded by NYC taxpayer money! (+ obviously because cities should strive to reduce discrimination as much as possible)

The internship basically entails a lot of me learning statistics and computer science (that pairing is data science in a nutshell) and working with big data to make statistical models, use machine learning to make inferences/predictions, draw conclusions, and write a paper. The data that we're using consists of all the info of all the businesses in New York State, all registered NYC MWBEs, and everyone who has ever contracted with the city - it's truly big data. At the end we're going to be presenting our work to the city, and hopefully we can influence them to enact policies/programs that better help minorities!

I'm really excited for this opportunity because it's interdisciplinary and gives me the opportunity to use STEM/CS to effect direct social change through policy, which is something that I'm passionate about and want to do in the future. From my experience, these types of non-lab, interdisciplinary STEM internships are somewhat rare in highschoolers, but opportunities are plentiful and very worthwhile, so I encourage you to seek them out if they sound interesting to you. :)

Over the next few weeks, I'll be blogging about all the happenings that take place during this internship. I hope you enjoy!

6/26/2017

This was my first day on the job! I was introduced to some of the PhD students and professors in the Stats department. They all come from really diverse backgrounds - Shanghai, Madrid, Mexico City, Seattle, Poughkeepsie, Brooklyn... - and seem like cool people. I spent the morning reading up on some very dense literature related to the MWBE project, and I also began reading through some statistics textbooks.

In the afternoon, I began learning R, which is the programming language that we are using. As a point of reference, R is very similar to Python. These two are the open source languages most commonly used in data science, along with Julia (go figure), a newer language. I've had some previous experience with R, but not much, so I'm eager to learn more! I really like it here in the department and I enjoyed this first day greatly. I'm excited for the coming weeks!

The Columbia School of Social Work, which the Stats department is in for some reason.

Low Library.

The sign at 116th Street Station. I'm always impressed by subway tilework.

6/27/2017

This morning I went downtown to City Hall for a meeting with the NYC Dept. of Education. Jon and some other PhD students had recently won a contest to most accurately project numbers of NYC kindergarteners in future years, for the purpose of rezoning school districts, so they went to present and discuss their data and models with the city. NYC's chief demographer, Joseph Salvo, was there, and it was really cool to get to meet him (I hope to meet more city figures like Bill de Blasio as well!!). Though I only acted as an observer during the meeting, I found it very interesting to listen to them discuss their methods and plans of action. I was also amused by how politics seemed to come into play at every moment, even for something small like this.

A cool dog I spotted by City Hall.

After we went back to the office, I read and studied more R and stats. I also began to look at the MWBE datasets and some preliminary code that Jon had written up, and started to experiment with testing hypotheses and manipulating the data.

So much data.......... this barely scratches the surface, as you can see by the small scrollbar

(+ I haven't imported ~60% of of our files lol)

To my annoyance, I found that all our datasets were very messy - there were missing observations everywhere, a lot of typos, and extra whitespace (e.g. we had some businesses labeled as "Hispanic" and others labeled as " Hispanic ", "ispanic", etc.). I had to look up a lot of functions on StackOverflow (lol) to deal with these problems. For instance, gsub("x", "y", DATA) replaces the string "x" with the string "y" in the dataset named "DATA".

6/28/2017 & 6/29/2017

Not very eventful, but I studied and read more, and did more with the data after I finished cleaning it up a bit. On a tangent - for fun, I've been trying a different place to eat at every day of the week. I've found that everything is delicious in NYC, especially food truck stuff! There's one Asian food truck that sits outside of the main gates to Columbia every day, Uncle Luoyang. I ate some really good shish kabobs from it on 6/29. I really recommend it :) All the restaurants on Broadway and Amsterdam Ave (the north-south streets bordering campus) - Sweetgreen, Junzi Kitchen, etc. - are delicious too!

6/30/2017

I've learned a lot of interesting functions in R, and with them I've been able to manipulate our MWBE data, test hypotheses, and make a lot of cool graphs! Here's an example:

Pretty!! (Partially censored because data is sensitive and not public.)

I think it's really cool that with R, I'm able to type a few lines of code and take hundreds of thousands, or even millions of rows of data and quickly turn them into neat, elegant, and informative displays! I'm excited to do more complex things (such as modeling and inference) with it as I continue to learn.

I've been reading this book, the Seven Pillars of Statistical Wisdom by Stephen M. Stigler. It has a lot of interesting stories about the history of statistics. For instance, there was a mathematician named John Arbuthnot who took baptism data and noticed that there were consistently more males than females each year. He came to the conclusion that this could only be possible because of "Divine Providence" (aka God). However, he failed to consider the statistical concepts of "no causation without randomization" and confounding factors before making this erroneous claim. More likely, the difference in births was because of high female infant mortality. There are a lot of other anecdotes like this in the book, and I found them very entertaining - you should read it!

One week is over, and I'm heading home for the weekend.

7/3/2017 - 7/7/2017

Can't believe it's already July. I'm back at work and continuing to study and code. I investigated the data in more detail, and I found that when you separate businesses into different groups based on certain characteristics, you see a lot of interesting variation that's covered up when you simply plot curves based on averages. We see that certain groups of people are doing much better than others. Next week I'm meeting with some city people, and Jon and I will be discussing my findings with them.

The conference room where I learn R in the afternoons.

The view from the Statistics department.

7/10/2017

Just realized that I haven't written a lot about the actual stats/coding concepts that I've been learning (whoops)! To sum up, I've basically just been studying basic stats - random sampling, probability, p-value, chi-square (ap bio!), stuff like that - and learning data manipulation and visualization using R from Hadley Wickham's book (he's the guy who created RStudio, the most popular IDE for R). I will try to talk more about this stuff this week, so stay tuned - it's pretty interesting!

I'm reading through this large pile of math books currently sitting on my desk, + two online books. It's a lot of reading.

7/11/2017

Time for coding lesson #1! I'll be explaining data visualization in R.

Here is another plot that I have made during this internship. I will be using it to explain some of the code I've done so far.

The very first thing you do when you load up R is install and load packages.

Packages in R are just like packages in other languages (Java, etc.). They contain all the methods and classes you need for certain tasks. They're basically black boxes, since you don't see the code that goes into making these packages. Tidyverse is arguably the most important package in R. It allows you to:

    • "tidy up" messy datasets

    • perform operations on datasets to convert them into better formats

    • do data visualization

    • and so on.

It makes life a lot easier, since you would have to type many, many lines of code to get your data in a nice form without it.

Data visualization is much easier than everything that comes before it (aka getting the data into an actually graphable format), so it seems like the most logical place to start.

This is the code that went into the plot you see above:

The ggplot function comes from the Tidyverse package. It's what is used to create plots. You can break this code down very easily:

    • Line 329 establishes that we are creating a plot using the dataset named "everything"

    • Line 330 establishes aesthetic aspects of the graph: x = some variable, y = some dependent variable, color = some other variable. Basically, "aes()" defines what is used for the x axis, y axis, etc.

      • As you can see, there is a lot of variation between different colored groups, which is why "color = " is used. If you didn't care about looking at the data at such a micro scale, you could just omit it.

  • Line 331 establishes that we are creating a best-fit smooth curve. There are a lot of different "geom_" functions, which all correspond to different types of graphs. For example, there exist geom_histogram, geom_point, etc.

  • Line 332 was used to set the y-axis bounds between 0 and 1.

As you can see, data visualization is relatively simple. This one graph was created in only 4 lines! However, data manipulation is very annoying, and I will be explaining some functions that are used for it in coding lesson #2 next week.

**Note 8/3: I've had to make graphs in Option II Java assignments, and I think they are a great example of why R is useful. This simple graph with only 4 data points was made using 84 lines of Java code (v.s. 4 lines of R code for a graph with hundreds of thousands of data points)

7/12/2017-7/14/2017

On Thursday we traveled down to City Hall again for a meeting with NYC DDC. We presented all the plots that I've made so far, and our client was very impressed (she thought I was a PhD student lol)!

So far I have mostly just been plotting curves based on a single independent/x variable, but this obviously is not the most scientifically sound method. There are so many factors that influence success that simply plotting one at a time leaves a lot of room for inaccuracy - it doesn't give you the whole picture. Ideally, we would smush all the confounding variables together into a model (think of the model as a multivariate equation with different coefficients for x, y, z, etc.).

So, next week we are going to work on creating logistic curves, doing propensity score matching, stochastic blockmodeling, cohort studies, disparity studies, etc. I just threw a lot of jargon at you, but basically it's just more complicated stuff than what we've being doing so far. It'll help us better understand the causality in all of this. Some quick definitions:

  • Logistic curve: Consider a certain variable, and assign values of "0" and "1" to observations. 0 = no "treatment", 1 = "treatment" (e.g. treatment could be MWBE certification). Plot a logistic curve (S shape), with the 0-1 variable as your y-axis. The x-axis and general shape of the curve are based upon other selected, influential variables.

  • Logistic model: Kind of the same concept as logistic curves, but more complex and informative. Think a bunch of logistic curves smooshed together. You can use these in data science/machine learning to take a training dataset and use it to do inference/make predictions, as well as account for confounding variables. This is what we are going to be doing.

  • Propensity score matching: Give all observations scores based on your model's predictions, and match across treatment/non-treatment groups for observations with the same score. Goal is to reduce the effect of confounding variables (apples to apples, vs apples to oranges) and try to prove causality.

  • Stochastic blockmodeling: Accounting for network confounding (definition: a certain network of people is disproportionately benefiting from the treatment and skewing your data) by creating clusters/groups using a formula. Apparently, this method is really new and popular in the statistics field right now.

  • Cohort studies: Tracking a group that was "born" in a certain year and following them through time (e.g. looking at all businesses certified as MWBEs in 2011 and seeing if there is a dramatic effect before and after certification)

  • Disparity studies: Quantifying the existence of disparity/inequality in a population

I'll probably explain more about these next week.

7/17/2017

Back at work and beginning to dive into more complex statistics!

Here is a video about propensity score matching, which is the main method we are using in our causality study.

PSM procedure

7/21/2017

This week, I focused a lot on tidying my code (it was poorly documented and ~600 lines long). I removed everything that I didn't need, condensed some functions into better forms, and organized the script so that I could run it straight through without errors. It took a very long time because I kind of forgot why I wrote some code (this is why you should include comments in your code, kids), and a bunch of the datasets were horribly large, some even multiple GB, so they each took about 20 minutes to run.

I then sent a copy over to Jon so we could discuss together how to modify our datasets to limit their size, the amount of missing data, and inaccuracy in our analysis. After working for a while, we turned those 2 GB files into 200 MB ones :) Everything looks good now - I think we're finally done with data manipulation!

Afterwards, I accomplished many of the things I listed previously - I plotted cohort graphs, made a bunch of logistic models, and wrote code to execute the stochastic blockmodel method.

I think the models/plots are really informative about the effects of the MWBE program and the issues it has.

Here is an example of one of our single-variable models, plotted (same color = equivalent variable). As you can see, in most groups there is a clear decreasing trend as you go along the x-axis. There is also a decent amount of variation between groups.

In addition to these, we also made a bunch of multilevel models (models based on multiple variables) which accounted for more variation. Sadly we can't plot those because those would be 4- or 5-dimensional, which is impossible to visualize. They really corroborate our hypotheses (good and bad) about the program which is great!

7/25/2017

Today and yesterday we made more/better models. Tomorrow, we will be comparing the residuals (error) and fit (accuracy) of these and then pick the best one, which we will then use for our final PSM analysis.

Fun fact - to generate our multilevel models, we used Stan, which is a language for Bayesian inference that was actually invented by a team of people here at Columbia! Which is convenient, since we can just go upstairs and ask the development team if we ever run into bugs/problems haha :).

Now all that's left is propensity score matching, which will be pretty quick - then it's time to write the research paper and present to the city!

7/26/2017

Refined our models more and did propensity score matching today!! Results look good.

*Apologies for super vague wording throughout this blog about our research. Can't give too much away about our findings, as they are sensitive.