Byte 6: Working with Big Data

Post date: Mar 23, 2014 2:00:26 AM

Description: Your final product will visualize and reason about public Vital Birth Statistics

Overview

To circumvent some of the difficulties (and costs!) of setting up a large data set, and because the course is primarily focusing on analytics, we will be using Google's BigQuery for this assignment. You may find this article on getting started with google Big Query useful. We'll make use of the public Vital Birth Statistics data set that BigQuery provides (taken from the CDC under the following restrictions). A range interesting questions can be explored in that data set. There is also a great tutorial on creating a graph of some data from this set that may be helpful to read. Our work will focus on visualizing some basic statistics and developing a classifier. We will be 'closing the loop' in this assignment, meaning that once the classifier is trained, we will let people use it within Google appspot. Learning goals for this assignment include:

  • Working with a massive data set
  • Becoming familiar with BigQuery
  • Applying machine learning algorithms that can function with very large data
  • Thinking about how to close the loop and actually use a classifier we produce

The Tutorial

Before we can begin using BigQuery, you will need to create a [yourname]-byte6 application in Google Appspot using the same methods we've used in the other bytes, and turn on the BigData API in the Google Developer's Console:

You will also need to install the Google Cloud SDK on your local machine. You will also need to add the following libraries to your appspot application: 'apiclient', 'httplib2', 'oauth2client' and 'uritemplate'. The simplest way to do this is to copy those directories over from [yourname]-byte3. Your app.yaml file should reference the following handlers and libraries (so that we can visualize results and so on)

handlers:
- url: /favicon\.ico
  static_files: favicon.ico
  upload: favicon\.ico
- url: /stylesheets
  static_dir: stylesheets
- url: /d3
  static_dir: d3
- url: /data
  static_dir: data
  application_readable: true
- url: .*
  script: main.app
libraries:
- name: webapp2
  version: "2.5.2"
- name: MySQLdb
  version: "latest"
- name: jinja2
  version: latest
- name: numpy
  version: latest
- name: django
  version: latest

Accessing the data set at the command line

To circumvent some of the difficulties (and costs!) of setting up a large data set in BigQuery (which limits how much data you can upload per day, and charges for data sets), we'll be using the public Vital Birth Statistics data set that BigQuery provides (taken from the CDC under the following restrictions). We can see that a range interesting questions can be explored.

A quick way to try out different queries of interest is using bq shell --project_id=XXXXX at the command line. Once you install the Google Cloud SDK you should have access to the program bq. In my case, I have to start a new shell in Terminal and then run two scripts Google provides inside the installation directory for the Cloud SDK ('completion.bash.inc' and 'path.bash.inc'). Once I run those scripts, I can type at the prompt and see the following:

$ bq shell --project_id=jmankoff-byte6
Welcome to BigQuery! (Type help for more information.)
jmankoff-byte6> 

Now I can enter a query such as:

jmankoff-byte6>SELECT state, count(*) FROM [publicdata:samples.natality] GROUP by state

and the response looks something like the following table (elided for brevity). Note the large entry under NULL -- we never get away from data set problems, do we!

+-------+----------+
| state |   f0_    |
+-------+----------+
| AL    |  1957392 |
| AK    |   324183 |
| AZ    |  1837630 |
...
| WI    |  2294354 |
| WY    |   218735 |
| NULL  | 16998008 |
+-------+----------+

Creating a Cloud Storage Area of Your Own

If you wish to make use of Google's cloud database on your own (something that was not required in Byte 4 but is if you want to use further storage in future projects including this one), you will need to enable billing with Google. Simply click on the Cloud Storage Browser (shown at left) in the Google Developers Console

... and you will see the following message:

You will need to click 'enable billing' and enter your credit card information. If you are concerned about costs, know that all the data stored in Byte 4's mobile data database has only cost me $30 so far this semester. The details on Google's billing scheme for pricing for Cloud SQL and pricing for BigQuery differ, but we are using public BigQuery data so we only need to worry about Cloud SQL in this Byte.

Once you enable billing, you can create a new instance where we can store some data.

Let's call it 'birthstats':

Accessing Data from Within Google Appspot

The command prompt is a useful place to experiment, but what we really want to do is access the same data from within Google Appspot. We will need to make use of OAuth2.0 to access data from within Google Appspot. For this, we will first need to know certain information about the application you have created. Start by clicking on [yourname]-byte6 in Google Developer Console.

To get the credentials for OAuth2 click on "APIs & auth" and then 'Credentials'. From there you can click 'download JSON' to download a client secrets file.

You should also double check under permissions that the following permissions are enabled:

Note that the top email starts with a number. This is your 'project number' (you can also see it if you click on the 'Overview' tab). Make sure to record it in 'main.py' under BigQuery API settings:

# BigQuery API Settings
PROJECT_NUMBER        = 'XXXXXXXXXXXX' 

Next we will need to specify the correct credentials and set up the authorization. Since we are only using a public data set, we want the readonly credentials mentioned on Google's BigQuery Authentication Documentation:

credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery.readonly')
service     = build("bigquery", "v2", http=http)

now assuming we have things set up for jinja2, in get(self) we can make a query and send the results to the 'index.html' template:

query     = {'query':'SELECT state, count(*) FROM [{0}] GROUP by state;'.format(_DATABASE_NAME), 'timeoutMs':10000}
# service is the oauth2 setup that we created above
jobRunner = service.jobs()
# project number is the project number you should have defined in your app
reply     = jobRunner.query(projectId=_PROJECT_NUMBER,body=query).execute()
# similar to the google SQL work we did in byte4, the stuff we care about 
# is in rows
rows = reply[u'rows']
states = {}
for row in rows:
    name = row[u'f'][0][u'v']
    num = row[u'f'][1][u'v']
    # make sure we have a string in every case
    if name == None: name = u'None'
    state = {'state':unicode.encode(name)], 'total':int(num)}
    states = states + [state]
context = {"states": states}
# and render the response
self.render_response('index.html', context)

Finally we can show the table using jinja2:

<!DOCTYPE html>
<html>
<head>
  <link href="stylesheets/d3.css" rel="stylesheet" type="text/css">
  <meta charset="UTF-8">
  <title>Data Pipeline Project Byte 6 -- Big Data</title>
</head>
 <body>
   <h1>Data Pipeline Project Byte 6</h1>
   <h2>Query Results</h2>
   <ul>
   {% for state in states %}
      <li>{{state['state']}} : {{state['total']}}</li>
   {% endfor %}
 </body>
</html>

An excerpt from the results:

Debugging Hint

Unfortunately, the oauth2 queries will only work on the Appspot server. To avoid having to do all of our testing after uploading to the server each time, we can capture the query result from the server log and save it in 'data/states.json' (provided with your source code). Then we can modify our application to check whether or not we are running on the server and either run the original query or just load the 'states.json' file depending on our status. Eventually when we go interactive this will not work, but it will help tremendously at this stage. I have created a function run_query(query_string, filename), which decides whether to run the query or load a file depending on where the code is running:

 if (os.getenv('SERVER_SOFTWARE') and
     os.getenv('SERVER_SOFTWARE').startswith('Google App Engine/')):
     query = {'query':query_string, 'timeoutMs':10000
     jobRunner = service.jobs()
     return jobRunner.query(projectID=_PROJECT_NUMBER, body=query).execute()
 else:
     # open the data stored in a file called filename
     try:
        fp = open(filename)
        return simplejson.load(fp)
     except IOError:
        logging.info("failed to load states.json file")
        return None

I had some trouble creating 'states.json' by hand so eventually I pasted it into a variable using python at the command line (e.g. data = <paste>) and then typed:

>>> fp = open("data/states.json", 'w+')
>>> json.dump(data,fp)

Some additional debugging ideas: I find it useful to debug my python JSON parsing at a command line prompt locally (by grabbing the JSON produced by queries from the server log after uploading my application for the first time). Equally useful is to debug your SQL queries using the shell mentioned above. This will save you many iterations (and uploads when working on the server) and make it much easier to identify and fix problems. For example, I copied the query results into a 'data' variable at my python prompt and then went through the following sequence of debugging exercises:

>>> rows = data[u'rows']
>>> row = rows[0]
>>> row
{u'f': [{u'v': u'AL'}, {u'v': u'1957392'}]}
>>> state = row[u'f'][0][u'v']
>>> num = row[u'f'][1][u'v']
>>> "state %s num %s " % state, num
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "state %s num %s " % state num
  File "<stdin>", line 1
    "state %s num %s " % state num
                                 ^
SyntaxError: invalid syntax
>>> "state %s " % state 
u'state AL '
>>> unicode.decode(state)
u'AL'
>>> unicode.encode(state)
'AL'
>>> num = obj[u'f'][1][u'v']
>>> int(num)
1957392

and this last bit was after my web page [yourname]-byte6.appspot.com returned the following error and I traced it to the very last state returned by looking at my log printout of each state and number on the server log and realized that the empty state values were returning some sort of special value instead of a string:

>>> if state = None: state = u'None'
  File "<stdin>", line 1
    if state = None: state = u'None'
              ^
SyntaxError: invalid syntax
>>> if state == None: state = u'None'
... 
>>> state
u'AL'
>>> state = None
>>> if state == None: state = u'None'
... 
>>> state
u'None'

Homework Part 1: Visualizing your data

At this point, you have all of the information you need in order to query and visualize historical data about births in the U.S. since the 1970s. Rather than walk you through things you already know how to do, I will simply encourage you to play around and discover some interesting new things. You can read up more on the big query api for python, and explore the sample queries to get some ideas for what to do with the data. You don't absolutely have to stick with the birth data (other public data sets are available), however the second half of this assignment is designed around using machine learning on the same data, so you may find it valuable to do so.

Machine Learning with Big Data

For the machine learning part of this assignment, you will attempt to predict the health of a baby being born. Health of babies is measured using an apgar test, which is rated from 0 (unresponsive and not breathing) to 10 (perfect). We will use 7 as a cutoff for 'unhealthy'. In order to make your life much much easier, I have implemented a number of helper functions for the machine learning work you need to do. The key ones you should look at in the code, all well commented, include:

  • get_data (which will fetch a data set you can use for optimization, testing, or training) and return it as two arrays, the feature array, and the labels. get_data is also used to get a test set. We separate the test set and the training set by assigning odd rows to one and even rows to the other from the database. We use a random number as the index to ensure that we get different rows each time we run it.
  • make_query_string (Helper function that makes a string based on the features and predict value that can retrieve data for get_data. If you want to use non-numeric features, you will have to modify this function).
  • run_query (you already heard about this)
  • apgar_1min_test (labels an apgar value as 'healthy' (1) or 'unhealthy' (0))
  • loss_func (this is the standard loss function discussed in class. You won't need to modify it)
  • gradient_descent (this is the standard logistic regression gradient descent function discussed in class. You shouldn't need to modify it)
  • stochastic_gradient_descent (this is the function that collects up examples of X and y labels and calls gradient_descent on them repeatedly to find an approximate minimum. If you want to experiment with the regularization parameter, you might consider making a similar function which pulls down a single optimization set and then tries a bunch of different lam values on it. However you should not need to directly modify this function, except possibly to change the batch size to something smaller so things run quickly.
  • apply_model (this function applies a machine learning model to a test data set that is passed in to it. it calculates the accuracy and returns it).

You may also want to make note of these global variables:

  • _MAX_ROWS is the maximum number of rows that will be drawn from the database in any given query
  • features is the set of features that will be drawn from the database
  • _APGAR_CUTOFF is the cutoff used for the apgar value
  • _TIMEOUT controls how long it is ok to wait for things to finish without causing an error

The gradient_descent algorithm loops normally loops repeatedly over the same data set. This is an appropriate way to train logistic regression using gradient descent. However, a better approach for big data is to select a different random subset each time we loop over the data (i.e. to train over small batches). This does require additional queries but has the nice property that we do not have to load the entire data set into memory in order to complete the learning process. It also works nicely in situations where parallelization is possible. This is a variant of stochastic gradient descent, which is supported by mainstream toolkits (including scikit-learn). A great tutorial on stochastic gradient descent with python code that you can use, can be found on deeplearning.net.

Note that the heart of logistic regression is a simple loop in which we repeatedly recalculate the weights (up to a maximum number of times) and check whether we have reached a stop condition. The following code is a sample code of stochastic gradient decent. Each time we use 1 example to train our model in each iteration. You can also try batch gradient decent to train your model using all the examples you have retrieved in each iteration.

for i in xrange(iterations):
  for r in xrange(_MAX_ROWS):
    # run the next iteration of gradient descent

weights = old_weights + step*(self.loss_func(X[r], old_weights - y_labels[r]))*X[r]

weights = weights - lam*2.*step*old_weights

Note that a more efficient implementation of this could take advantage of numpy's matrix manipulation abilities. That is left as an (optional) exercise for the student in this instance and throughout the sample code.

In stochastic gradient descent we add an outer loop to this:

for i in xrange(batch):
  # collect some good sample data

bad_births, bad_births_labels = self.get_data(features, predict, prediction_test, zeros_string)

good_births, good_births_labels = self.get_data(features, predict, prediction_test, ones_string)

y_labels = np.concatenate([bad_births_labels, good_births_labels])

X = np.concatenate([bad_births, good_births])

# and run gradient descent on it

weights = self.gradient_descent(weights, X, y_labels)

# return a model at the end

model = lambda x: self.loss_func(x, weights)<.5

return model, weights

Note the importance of the loss function both in the gradient descent algorithm and in building our final model. As discussed in class, it is fairly straightforward:

logistic = lambda s: 1.0 / (1.0 + numpy.exp(-s))

We can use this to calculate the accuracy in apply_model. Or we can simply make a single prediction, if we want to put this online for others to use.

Your job is to use this code to further explore how predictable baby's health is (you can switch to a different data set if you want, of course, probably best to use another one freely available on bigquery). My accuracy is around 88% so far on the data I've played with (with a more extensive set of features than I've given you). Your job is to do the following:

  1. implement a zeror learner which returns a model that can be used to calculate zeror accuracy using apply_model for comparison to the regression model
  2. extend apply_model to also calculate precision and recall
  3. Show the results on your web page (you may want to consider how to explain the weights in terms of which features are important). Extra kudos if you check for significant difference between zeror and your regression model and report that too.

In some ways the specifics of the implementation don't matter -- we are simply fitting to data and then predicting using what we learned. In other ways they matter quite a bit, however. Let's look at our data and how we connect it to this algorithm.

Handin Expectations

The handin expectations for this assignment are based around your working byte6

1) Your byte should show some exploration of the data set (beyond the table already included with the assignment), and show the results of the zeror and regression learners for accuracy, precision and recall.

2) Your byte should show the regression results as described above

Going Further (non required source of fun ideas)

There is lots more fun stuff you can do with this Byte. For example, you could improve the speed by using better matrix math and/or adding ajax support for showing incremental results as the stochastic gradient descent algorithm is running (you'll want to significantly decrease _MAX_ROWS and increase batch size in that case). Or you could implement a user interface that lets someone set parameters themselves (as if they were a pregnant mom) and see the prediction

You could also create an optimization set, and use it to try some things that improve the accuracy further (depending on your coding skill, you could play with a different data set, play with different values for the regularization parameter lam, try adding different features, add support for non-numeric features, and so on.

Google BigQuery provides access to several public data sets. One is GitHub's commits (including the comments). This blog post looks at the emotions in those comments.

There are lots of other data sets one could upload to a service like BigQuery. One example data set that could be a lot of fun to work with is described in this paper on Meme tracking in the global news cycle. The meme dataset can be downloaded from snap and a description of the structure is there as well. It's quite simple. Each file includes: The URL of the document (labeled P) and the time of the post (labeled T) and one or more key phrases (labeled Q for quote) as well as zero or more hyperlinks to other documents in the data set (labeled L).

Before the data can be loaded into BigQuery, it needs to be formatted according to the BigQuery guidelines. One key requirement is that it be denormalized, luckily already a property of this data set. However, it is not in JSON format, so the first step is to write a quick script to convert it.

How could you improve on the user experience? What visualizations might you add? How could you you improve on the speed?

If you want to use Google's prediction API, you can do quite a bit even without scikit-learn (including online learning of regression). To use it, however, you have to move subsets of the data into regular cloud storage from the BigQuery storage.