Byte 2: Exploring Data

Description: Your final product will be an exploration of data set of your choice ([yourname]-explore.appspot.com)
Source Code: See https://github.com/jmankoff/data, in Assignments/jmankoff-explore.
Hand In: Canvas (see questions at end of assignment)

Overview

In this project, you will create a new application that explores static data available online through Google Fusion Tables (tutorials, examples, and public tables). This assignment has the following learning goals:

Accessing data
Displaying information using Jinja (useful for later Bytes and Projects)
Using SQL to query Google Fusion Tables
Using new data formats including CSV and JSON
Reading and writing from files
Conducting data clean up and exploratory data analysis

Detailed Instructions for Explore Data Byte

Although we have created a visualization in the previous byte, we do not yet have programmatic access to the data. Our next step is to download data from fusion tables into Python. To do so, we will make use of the Google Cloud Console. There is great getting started documentation for the api_client_library that you should look over.

Register your Application and Download the API Keys

Go the Google Cloud Console and make sure you are in the correct byte. Then find the API Manager area, and turn on the Fusions Table API (hint: you may have to search for it in the Google Cloud Console). You will also need to create 'Credentials' for accessing the data from a Web server (an 'API key'). I named mine 'jmankoff-explore'. Leave the box for IP addresses blank and then press OK. This will generate a new key (ideally we would restrict the IP address, and to do so you would need to know the IP addresses for your local computer and for google's appspot servers. That is left as an exercise to the reader).

Be sure to save the resulting key! You can just copy it off their website and into your code. It should be something like: XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX

We'll store the key in our code.

API_KEY = XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX

Using Discovery to Handle API Access

Since we already have the API key, all we really have to do is make sure that we use it every time we request data. Since we are using one of Google's own APIs (instead of a third party) we can leverage much more powerful libraries than last time around to access data. In particular, we will follow the description in the getting started introduction for building a 'service object'

Before we start, let's make sure we've included the correct set of libraries in our application. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For Fusion Table access:

json will be used to decode certain response objects (from SQL queries of google fusion tables). JSON is an open text based standard for data, similar to XML in its uses.
jinja2 (as with the previous byte
urllib (a standard library for working with urls)
build (from apiclient.discovery). To get this: First, download the Google API Python Client, unzip it, and copy the subdirectories 'googleapiclient' into the [yourname]-explore project directory. Build depends on some other python libraries. You will need to
- download six and move the file 'six.py' into the lib directory of your application
- download httplib2 and move the directory 'httplib2' (found inside of the directory 'python2' into the lib directory of your python application
- download uritemplate and move the directory 'uritemplate' into the lib directory of your python application
- download oauth2client and move the directory 'oauth2client' into the lib directory of your python application
- add the following to the libraries portion of your app.yaml file (along with webapp2 and jinja2)
  - - name: ssl
  - version: latest

Now, set up your application (main.py) with the correct set of libraries:

# Imports

import os

import jinja2

import webapp2

import logging

import json

import urllib

# this is used for constructing URLs to google's APIS

from googleapiclient.discovery import build

Now we can prepare to connect to fusion tables:

# This uses discovery to create an object that can talk to the

# fusion tables API using the developer key

service = build('fusiontables', 'v1', developerKey=API_KEY)

At this point in the tutorial we need to have in mind specific things that we want to do with the fusiontables API. We will need to know the 'address' of the table we created. To find this, open your fusion table's File menu and click on About This Table. The ID at the bottom is the key you will need to access the data programmatically (it will be something like: NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX

Store it in a variable:

TABLE_ID = 'NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX'

Now you can begin to explore the information you can retrieve from the table. The API structure is provided as nested method calls in the service library (as documented in the tutorial on building and calling a service). You'll want to look at the documented set of API calls available as part of the APIs explorer to find out what specific method (API) calls you can make. For example,

request = service.column().list(tableId=TABLE_ID)

will get a list of column names for TABLE_ID.

It can be a bit confusing to tell what the correct API calls supported by your service will be. First, it is important to remember that the parameters are case sensitive (so tableID will fail for example). Second, if you look at other parts of the documentation, you may be confused about what is supported. For example, in the documentation on "Using the API" it looks as though you would need to specify service.tables(tableID=TABLE_ID), but this will not work correctly (or even compile).

Using the data

Now that we have the data, let's explore it. First, be aware that the data is in JSON format. This format is well supported by python using a dictionary structure. We'll want to get a feel for the data using a few different tools:

Summarizing the raw information in different ways using descriptive statistics
Graphing the data
Asking specific questions that we might expect certain answers to or to quantify our intuition
Re-organizing the data into different groupings that might be more informative

While there are other tools we will develop as the course goes on, these are an excellent starting place.

First, make sure you understand the structure of the data. Each row is a different animal that arrived at the pond. What do the columns mean? What range of values do they have? Are those values categorical? Numerical? Text? Take the time to look at each column. You can use Google Fusion Tables to create a chart summarizing each column one by one. When you do, what discoveries do you make? You will probably quickly find that the fusion tables interface leaves much to be desired. The table itself is too large to get a sense of by scrolling through it. So what alternatives do we have?

We will download the data into Python to explore it. Let's go ahead and create a query that will get this information, using sql, a special-purpose language for querying relational databases. There is a fusion tables version of SQL and we'll be using the SELECT command. The format for a query is:

https://www.googleapis.com/fusiontables/v1/query?sql=query

Let's download some sample data for the first two dogs in the table (note that your column names will likely be different):

def get_all_data(self):

    query = "SELECT * FROM " + TABLE_ID + " WHERE  AnimalType = 'DOG' LIMIT 2"

    response = service.query().sql(sql=query).execute()

    logging.info(response['columns'])

    logging.info(response['rows'])

    return response

And look at the results in the log:

INFO     2014-01-07 19:31:41,439 main.py:88] [u'Outcome', u'AnimalID', u'AnimalType', u'IntakeDate', u'IntakeYear', u'IntakeMonth', u'Name', u'Breed', u'Age', u'Sex', u'SpayNeuter', u'Size', u'Color', u'IntakeType', u'OutcomeSubtype', u'ZipFound', u'Latitude', u'Longitude', u'ZipPlaced', u'OutcomeDate', u'OutcomeYear', u'OutcomeMonth', u'icon type', u'IconName']

INFO     2014-01-07 19:31:41,439 main.py:89] [[u'', u'A346562', u'DOG', u'2011-04-16T00:00:00Z', u'2011', u'April', u'DIXIE', u'', u'Older than 1 year', u'MALE', u'true', u'MED', u'BLACK', u'Owner Surrender', u'', u'40218', 38.1938, -85.6573, u'NaN', u'2011-04-16T00:00:00Z', u'2011', u'April', u'NaN', u''], [u'', u'A364651', u'DOG', u'2011-03-11T00:00:00Z', u'2011', u'March', u'KAYLEE', u'YORKSHIRE TERR - POODLE TOY', u'Older than 1 year', u'FEMALE', u'true', u'TOY', u'BLACK', u'Impound', u'', u'40202', 38.2529, -85.7514, u'40218', u'2011-04-01T00:00:00Z', u'2011', u'April', u'NaN', u'']]

All of those 'u' annotations mean that we are dealing with unicode strings (with the exception of the latitude and longitude data). Some of these strings are legitimate (the column names can all remain strings; the AnimalID and AnimalType should be strings; and so on). However others would be easier to manipulate if they were in other formats (such as the dates). Let's start, however with some exploratory data analysis.

Showing the data on the web

We'd like to be able to see what has been loaded from Fusion Tables, and the log file is not entirely satisfactory. So we're going to also make sure we can see it on the web. One of the most powerful aspects of Jinja is its ability to display dynamic information provided by Python. We can pass one or more variables to Jinja by placing them in context in the appropriate handler (handler for index.html in main.py in this case):

return template.render(headers=cols, content=rows)

Next, update the 'index.html' file to show the information. Since we are using bootstrap, you should consult the bootstrap table reference for details on how the table is represented in html. Note the use of {% ... %}. This indicates some logic that should be executed (in this case a for loop). The contents of {{ ... }} are replaced with their value.

<tr>

{% for item in headers %}

{% endfor %}

</tr>

</thead>

<tbody>

{% for row in content %}

<tr>

{% for datum in row %}

<td>{{ datum }} </td>

{% endfor %}

</tr>

{% endfor %}

</tbody>

</table>

</div>

The resulting output looks like this:

The table that we just created is useful for getting a sense of your data, but it can be unwieldy to work with. In addition, it is not the best way to visualize your data. Instead, it is often better to perform data clean up outside of the application and then use clean data in your application.

Choose your adventure: conducting data clean up and exploratory data analysis

The remainder of this assignment is concerned with characterizing the data, but you must first choose how you want to clean up and explore your data. Below you will find two options. The first option (Beginner to Intermediate level) uses a GUI application to load, clean up, and explore your data. The other option (Expert level) uses a Python Notebook and a number of Python. Choose your adventure based on your expertise. Note that you should choose only one option!

Exploratory Data Analysis in Wrangler (Beginner-Intermediate Level)

In this adventure you will use Wrangler to unwrangle your data. The Wrangler page has a video tutorial to help you get started. Make sure you watch and follow this tutorial to get familiar with the tool.

The first step is to download the data from your Fusion Table. You can do this by opening your table in the web browser, and then clicking on the "Download..." button in the File menu. Select CSV and save the file to your local machine. Then load the data into Wrangler and clean up your data using the Wrangler UI.

Once you are done cleaning up the data, export it by clicking on the "Export" button and then select Comma Separated Values. Once you download the data, use it to load it into a new Fusion Table (using the Fusion Tables UI and by selecting "New Table" from the File Menu).

Then use the same visualization technique like in Byte 1 to display your Fusion Table data in your appspot application.

Exploratory Data Analysis in Python (Expert Level)

In this adventure you will use Google Data Labs Notebook to clean up and explore the data. To do all of this, we will create a Python Notebook just for us (not for use in your google appspot application). We will also want to download the data locally so that we can write it to a file and then experiment easily and iteratively with this. Next, we will want to summarize the data. Finally, we may want to visualize aspects of it.

To create a Python Notebook, we will use Google Data Labs (Google's version of Jupyter), which connects to you Google Cloud Platform projects. Google enables you to run the notebook both locally and in the cloud (mostly for large jobs), but the cloud version may incur charges. For this byte we will stick with the local version. Please follow this tutorial to install and configure your local Google Data Labs.

Once you install and run your local Google Data Labs instance (HINT: remember to connect it to your Google Cloud Project for this byte), you can connect to it by going to http://localhost:8081/. Note that all of the documentation and tutorials on how to use Google Data Labs will be included with the installation and accessible as notebooks. You can access these resources by going to http://localhost:8081/tree/datalab/docs

Once you are familiar with the basics of Python Notebooks go to the notebooks directory and start a new notebook (you could name it "[your name]-exploration"):

It is relatively easy to download Fusion Table data in JSON format and store it in a file. This will be the first part of our new exploratory analysis file, which we could download and store in our new notebook. Note that there are some small changes in what we import. You may have to install some of these libraries if your python cannot see them by default, typically "pip install [library]" will work.

import csv

import httplib2

from apiclient.discovery import build

import urllib

import json

# This API key is provided by google as described in the tutorial

API_KEY = '...'

# This is the table id for the fusion table

TABLE_ID = '...'

try:

    fp = open("data.json")

    response = json.load(fp)

except IOError:

    service = build('fusiontables', 'v1', developerKey=API_KEY)

    query = "SELECT * FROM " + TABLE_ID + " WHERE AnimalType = 'DOG' LIMIT 10"

    response = service.query().sql(sql=query).execute()

    fp = open("data.json", "w+")

    json.dump(response, fp)

The try/except here checks whether the data is already downloaded by trying to open the file 'data.json'. If the data is downloaded, it simply loads it out of the file. If not, it connects to google fusion tables and downloads it, just as we do in our regular Google appspot code. At the end of all of this, response contains the parsed json as a dictionary just exactly as in our regular Google appspot code. Once you are sure this is working, go ahead and remove the 'data.json' and the words "LIMIT 10" from your code and you can download the entire data set. If you print len(response['rows']) you can find out how many rows are in the data set.

Now that you have a way to download the data, you will want to use python to help you explore it. However, JSON data is notoriously difficult to view, so we will transfer the data into Pandas (a Python library made specifically for data analysis) and create a Pandas data frame. First, import Pandas and related libraries into your notebook (Pandas is installed by default):

import numpy as np

import pandas as pd

import matplotlib as mpl

import matplotlib.pyplot as plt

Then load the JSON data into a data frame:

data_df = pd.DataFrame(response[u'rows'], columns = response[u'columns'])

and display a few rows to make sure everything works:

data_df.head()

Pandas allows you to quickly access your data. Here is an example of how to access two columns from the original data frame:

lat_lon_df = data_df[['Latitude', 'Longitude']]

lat_lon_df.head()

We can also quickly clean up the data. For example, we can convert Latitude and Longitude to numeric values and remove empty values (all in one line). The code below applies to_numeric() function to each column in the data frame, then removes all NaN values:

lat_lon_df = lat_lon_df.apply(lambda x: pd.to_numeric(x, errors='ignore')).dropna()

Now that we have the data in a data frame, we can use Seaborn (a Python library for statistical data visualization) to visualize and explore the data. Seaborn also comes installed with the Google Data Labs. For example, use Seaborn to display a pairplot to visualize Latitude and Longitude as both individual histograms and scatter plots:

sns.pairplot(lat_lon_df)

Although you won't be able to use the same code for your data set, it should give you an idea of some things you can do in your own analysis. It can be found as 'explore.ipynb' in the github repository for assignment 2 and is carefully commented. You can also find ideas for different Seaborn plots in this gallery.

Once you have the plots, save them as images and include them in your appspot application.

Questions you should be able to answer when you hand this in

When you are done, you should be able to answer these questions and provide evidence to support your answers. You should do this in your appspot application in the 'about.html' and 'quality.html' files. You may use hand drawn sketches, generated plots, and live interactive graphics to illustrate your answers.

Is the data complete? If not, what is missing and what concerns does it raise?
Is the data coherent?
- Does it make sense relative to itself?
- Does it match your expectations?
- Is the data is distributed across its possible values in a way that makes sense?
Is the data correct?
- What aspects of this data can you verify?
- What did you find?
Is the data accountable?
- Where did it come from?
- Who has access to it?