Byte 2 v1
- Description: Your final product will be an exploration of an animal adoption data set ([yourname]-byte3.appspot.com)
- Due date: 1/30 (8pm)
- Hand In: Blackboard (see questions at end of assignment)
Overview
In this project, you will create a new application that collects data from Louisville Animal Metro Services. This assignment has the following learning goals:
- Using static data from Google Fusion Tables (tutorials) (from Google research)
- Learning to use a key to access an API
- Using SQL to query Google Fusion Tables
- Using new data formats including csv and JSON
- Reading and writing from files
- Conducting exploratory data analysis
Detailed Instructions for Byte 2
Using Google Fusion Tables
We will use a table of animal outcomes from Louisville Animal Metro Services that includes data from January 2011 forward. This data has already been imported as a publicly available Google Fusion Table. Because it is not your own data, even if you log in, you will not be able to access all of the features of Google Fusion Tables. For this reason, you should make a copy (under the file menu) before proceeding. Click "View Copy" and proceed with the new table for the remainder of this assignment.
Exploring your data in Google Fusion Tables
Start by following the tutorial for making a map (note that we can skip the first half of the tutorial since the data is "imported" as soon as you make your copy). When I was done, I had the following display on my map:
Now let's suppose we want to mark animals that were euthanized in with a different marker than those that were not. You can do this by creating a table that associates outcome type with marker types, as described in this tutorial. After following the tutorial, my map looked like this (you can find lots of interesting icon types at this change placemark icon tutorial).
Maps are only the beginning -- you can also explore the data using other types of charts:
Showing your visualizations in your google application
The visualizations you create can be embedded in the [yourname-byte2] application. You will need to configure the fusion table following the tutorial on embedding to be publicly accessible (mine is accessible to only those with the correct url). You will also need to set up your byte2 app to use jinja and templates (similar to byte1). Once you have that set up, you can paste the iframe code provided with your new map or chart into your index.html file. When I was done mine looked like this (the artifact at left is just other stuff behind my web browser). The source fusion table is publicly available here.
Debugging as you go
Partway through this assignment, I discovered that we cannot have column names that have special characters in them in Fusion Tables when we access them programmatically. For this reason, change 'Estimated Age' to 'Age'; 'Outcome Type' to 'Outcome'; and 'Animal Type' to 'Animal' in the Google Fusion Tables interface. This may seem like a strange requirement, but many bugs arise for these sorts of reasons, and that's why debugging systematically so that you check all of your assumptions (and discover the assumptions of others) is so important.
The work we are doing is especially complex because it involves multiple moving parts. It is helpful to know about several options for tracing what is happening:
- When you are exploring solutions to a problem of this sort, we highly recommend using logger.info(...) frequently to double check that things are proceeding as you expect.
- Don't forget to view the log itself, which will have not only your printouts, but also other bugs that arise, what exceptions happen, and so on
Collecting data from fusion tables
Although we have now created a visualization, we do not yet have programmatic access to the data. Our next step is to download data from fusion tables into Python. To do so, we will make use of the Google API Client Library. There is great getting started documentation for the api_client_library that you should look over.
Register your Application and download the OAuth keys
Go the Google API Console:
and click on the [yourname]-byte2 project:
Then click on "APIs & auth" and find the fusion table API in the list of APIs and turn it "On". Next, you need to select "Credentials" and click "Create a New Key" under "Public API Access":
We will click the "Server Key" button and leave the box for IP addresses blank and then press OK. This will generate a new key (ideally we would restrict the IP address, and to do so you would need to know the IP addresses for your local computer and for google's appspot servers. That is left as an exercise to the reader).
Google makes it really easy for us to get this API key -- we can just copy it off their website and into our code. Open up the "Server Key" menu and copy over the API key (not shown below), which should be something like: XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX
We'll store the key in our code.
API_KEY = XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX
Using Discovery to Handle API Access
Since we already have the API key, all we really have to do is make sure that we use it every time we request data. Since we are using one of Google's own APIs (instead of a third party) we can leverage much more powerful libraries than last time around to access data. In particular, we will follow the description in the getting started introduction for building a 'service object'
Before we start, let's make sure we've included the correct set of libraries in our application. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google, similarly to how we installed feedburner in the last project. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For Fusion Table access, we will be using
- json (from webapp2_extras) will be used to decode certain response objects (from SQL queries of google fusion tables). JSON is an open text based standard for data, similar to XML in its uses.
- We also need to import
from
webapp2_extras
import jinja2
. - urllib (a standard library for working with urls)
- build (from apiclient.discovery). To get this: First, download the Google API Python Client GAE, unzip it, and copy the subdirectories into the [yourname]-byte2 project directory.
Now, set up your application with the correct set of libraries:
# standard imports you should have already been using
import webapp2
import logging
from webapp2_extras import jinja2
import urllib
# this library is for decoding json responses
from webapp2_extras import json
# this is used for constructing URLs to google's APIS
from apiclient.discovery import build
Now we can prepare to connect to fusion tables:
# This uses discovery to create an object that can talk to the
# fusion tables API using the developer key
service = build('fusiontables', 'v1', developerKey=API_KEY)
At this point in the tutorial we need to have in mind specific things that we want to do with the fusiontables API. We will need to know the 'address' of the table we created. To find this, open your fusion table's File menu and click on About This Table. The ID at the bottom is the key you will need to access the data programmatically (it will be something like: NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX
Store it in a variable:
TABLE_ID = 'NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX'
Now you can begin to explore the information you can retrieve from the table. The API structure is provided as nested method calls in the service library (as documented in the tutorial on building and calling a service). You'll want to look at the documented set of API calls available as part of the APIs explorer to find out what specific method (API) calls you can make. For example,
request = service.column().list(tableId=TABLE_ID)
will get a list of column names for TABLE_ID
It can be a bit confusing to tell what the correct API calls supported by your service will be. First, it is important to remember that the parameters are case sensitive (so tableID
will fail for example). Second, if you look at other parts of the documentation, you may be confused about what is supported. For example, in the documentation on "Using the API" it looks as though you would need to specify service.tables(tableID=TABLE_ID)
but this will not work correctly (or even compile).
Exploring the data
Now that we have the data, let's explore it. First, be aware that the data is in JSON format. This format is well supported by python using a dictionary structure. We'll want to get a feel for the data using a few different tools:
- Summarizing the raw information in different ways using descriptive statistics
- Graphing the data
- Asking specific questions that we might expect certain answers to or to quantify our intuition
- Re-organizing the data into different groupings that might be more informative
While there are other tools we will develop as the course goes on, these are an excellent starting place.
First, make sure you understand the structure of the data. Each row is a different animal that arrived at the pond. What do the columns mean? What range of values do they have? Are those values categorical? Numerical? Text? Take the time to look at each column. You can use Google Fusion Tables to create a chart summarizing each column one by one. When you do, what discoveries do you make? You will probably quickly find that the fusion tables interface leaves much to be desired. The table itself is too large to get a sense of by scrolling through it. So what alternatives do we have?
We will download the data into Python to explore it. Let's go ahead and create a query that will get this information, using sql, a special-purpose language for querying relational databases. There is a fusion tables version of SQL and we'll be using the SELECT command. The format for a query is:
https://www.googleapis.com/fusiontables/v1/query?sql=query
Let's download some sample data (the first two dogs in the table)
def get_all_data(self):
query = "SELECT * FROM " + TABLE_ID + " WHERE AnimalType = 'DOG' LIMIT 2"
response = service.query().sql(sql=query).execute()
logging.info(response['columns'])
logging.info(response['rows'])
return response
And look at the results in the log:
INFO 2014-01-07 19:31:41,439 main.py:88] [u'Outcome', u'AnimalID', u'AnimalType', u'IntakeDate', u'IntakeYear', u'IntakeMonth', u'Name', u'Breed', u'Age', u'Sex', u'SpayNeuter', u'Size', u'Color', u'IntakeType', u'OutcomeSubtype', u'ZipFound', u'Latitude', u'Longitude', u'ZipPlaced', u'OutcomeDate', u'OutcomeYear', u'OutcomeMonth', u'icon type', u'IconName']
INFO 2014-01-07 19:31:41,439 main.py:89] [[u'', u'A346562', u'DOG', u'2011-04-16T00:00:00Z', u'2011', u'April', u'DIXIE', u'', u'Older than 1 year', u'MALE', u'true', u'MED', u'BLACK', u'Owner Surrender', u'', u'40218', 38.1938, -85.6573, u'NaN', u'2011-04-16T00:00:00Z', u'2011', u'April', u'NaN', u''], [u'', u'A364651', u'DOG', u'2011-03-11T00:00:00Z', u'2011', u'March', u'KAYLEE', u'YORKSHIRE TERR - POODLE TOY', u'Older than 1 year', u'FEMALE', u'true', u'TOY', u'BLACK', u'Impound', u'', u'40202', 38.2529, -85.7514, u'40218', u'2011-04-01T00:00:00Z', u'2011', u'April', u'NaN', u'']]
All of those 'u' annotations mean that we are dealing with unicode strings (with the exception of the latitude and longitude data). Some of these strings are legitimate (the column names can all remain strings; the AnimalID and AnimalType should be strings; and so on). However others would be easier to manipulate if they were in other formats (such as the dates). Let's start, however with some exploratory data analysis.
Exploratory Data Analysis in Python
The remainder of this assignment is concerned with characterizing the data. To do all of this, we will create a python file just for us (not for use in your google appspot application). We will also want to download the data locally so that we can write it to a file and then experiment easily and iteratively with this. Next, we will want to summarize the data. Finally, we may want to visualize aspects of it.
Downloading the Data
It is relatively easy to download JSON data and store it in a file. This will be the first part of our new exploratory analysis file, which we could store in '[yourname]-explore.py' and execute at the command line by calling "python [yourname]-explore.py". Note that there are some small changes in what we import (this is because at least on my local machine, webapp2 is not visible to my command line python). You may have to install some of these libraries if your python cannot see them by default, typically "pip install [library]" will work.
import csv
import httplib2
from apiclient.discovery import build
import urllib
import json
# This API key is provided by google as described in the tutorial
API_KEY = '...'
# This is the table id for the fusion table
TABLE_ID = '...'
try:
fp = open("data.json")
response = json.load(fp)
except IOError:
service = build('fusiontables', 'v1', developerKey=API_KEY)
query = "SELECT * FROM " + TABLE_ID + " WHERE AnimalType = 'DOG' LIMIT 10"
response = service.query().sql(sql=query).execute()
fp = open("data.json", "w+")
json.dump(response, fp)
The try/except here checks whether the data is already downloaded by trying to open the file 'data.json'. If the data is downloaded, it simply loads it out of the file. If not, it connects to google fusion tables and downloads it, just as we do in our regular Google appspot code. At the end of all of this, response contains the parsed json as a dictionary just exactly as in our regular Google appspot code. Once you are sure this is working, go ahead and remove the 'data.json' and the words "LIMIT 10
" from your code and you can download the entire data set.
If you print len(response['rows'])
you can find out how many rows are in the data set (remember, add this to the end of your explore.py code and run python explore.py
to see the output).
Let's start, however, by simply printing out the number of occurrences of each unique value for each column. Note that we have removed the "LIMIT" from the query for this exercise. In addition, through careful iteration (looking at the output and then modifying the code repeatedly) I came up with an ignore list of columns I will not include in this demonstration analysis. Some are empty or unique for every row. Others are just left out to simplify this analysis.
summary = {} # this will be our summary of the data
columns = response['columns'] # the names of all columns
rows = response['rows'] # the actual data
# we'll ignore some columns
ignore = [u'Outcome', u'AnimalID', u'AnimalType', u'Name', u'IconName', u'IntakeDate', u'OutcomeDate', u'Latitude', u'Longitude', u'Breed']
for i in range(3, len(columns)): # loops through each column
if columns[i] in ignore: continue
answers = {} # will store unique values for this column
for row in rows:
key = row[i]
# convert any string values to ascii, and any empty strings
# to a string called 'EMPTY' we can use as a key
if type(key) is unicode: key = row[i].encode('ascii','ignore')
if key == '': key = 'EMPTY'
try: # increase the count the key already exists
answers[key] = answers[key] + 1
except KeyError: # or set it to 1 if it does not exist
answers[key] = 1
summary[columns[i]] = answers # store the result in summary
print summary
The result is as follows (leaving out the zipcode values for found and placed)
{u'IntakeType': {'Foster Return': 228, 'Reported Lost by Public': 144, 'Stray': 5501, 'Disposal Request by Public': 28, 'Out Patient Spay or Neuter': 107, 'Owner Surrender': 3192, 'Reported Found by Public': 12, 'Impound': 423}, u'Color': {'BLUE SMOKE': 1, 'GOLD': 82, 'UNKNOWN': 142, 'YELLOW': 150, 'BLUE MERLE': 31, 'BLACK': 2925, 'BLUE TICK': 7, 'BLONDE': 4, 'RED': 328, 'BROWN': 1766, 'RED MERLE': 11, 'BEIGE': 1, 'BR BRINDLE': 459, 'CHOCOLATE': 153, 'BLUE CREAM': 3, 'EMPTY': 2, 'BL BRINDLE': 113, 'ORANGE': 1, 'BRINDLE': 2, 'TAN': 852, 'BUFF': 34, 'FAWN': 131, 'GRAY': 209, 'APRICOT': 17, 'TRICOLOR': 440, 'RUDDY': 1, 'SABLE': 19, 'WHITE': 1425, 'SILVER': 39, 'BRN TIGER': 1, 'BLUE': 146, 'BLK SMOKE': 2, 'MERLE': 1, 'LIVER': 4, 'BRN MERLE': 16, 'CREAM': 117}, u'Age': {'Older than 7 years': 1250, 'Infant - Younger than 6 months': 1143, 'Older than 1 year': 3794, 'Youth - Younger than 1 year': 915, 'EMPTY': 2533}, u'OutcomeYear': {'2011': 5547, '2012': 4088}, u'Sex': {'UNKNOWN': 82, 'MALE': 5199, 'FEMALE': 4354}, u'SpayNeuter': {'false': 3931, 'true': 5704}, u'OutcomeMonth': {'February': 745, 'October': 495, 'March': 1039, 'August': 1048, 'May': 972, 'January': 667, 'June': 982, 'September': 889, 'April': 998, 'December': 454, 'November': 386, 'July': 960}, u'IntakeYear': {'2011': 5772, '2012': 3863}, u'OutcomeSubtype': {'TV Other': 26, 'Ky Humane Society': 664, 'Radio Other': 2, 'Petsmart Outerloop': 2, 'TV Metro': 1, 'TV Whas': 23, 'Foster': 73, 'Barkstown': 1, 'Enroute to Shelter': 22, 'Internet': 81, 'Animal Control': 7, 'Rescue Group': 1535, 'EMPTY': 2147, 'TV Wlky': 12, 'Requested': 864, 'Petsmart Hurstborne': 95, 'Web PF': 318, 'Exchange': 16, 'Pregnant': 6, 'Time and Space': 443, 'Web Metro': 34, 'Friend': 273, 'At Vet': 5, 'Facebook': 16, 'In Surgery': 59, 'Heartworm': 1, 'WEB': 1, 'Breed': 135, 'Refferal': 77, 'Web Pet Harbor': 2, 'Medical': 720, 'In Kennel': 16, 'Petsmart West': 3, 'In field': 119, 'Behavior': 691, 'Walk In': 733, 'Promotion': 213, '3rd Party': 39, 'at Foster': 29, 'Newsletter': 6, 'Event': 30, 'Rescue Waggin': 95}, u'IntakeMonth': {'February': 824, 'October': 329, 'March': 1038, 'August': 1056, 'May': 1003, 'January': 844, 'June': 968, 'September': 720, 'April': 978, 'December': 416, 'July': 1026, 'November': 433}, u'Size': {'MED': 3297, 'TOY': 277, 'LARGE': 2472, 'X-LRG': 216, 'SMALL': 1929, 'PUPPY': 1425, 'EMPTY': 19}}
This is a little difficult to read but if we take something specific like age, we can see pretty clearly that there are a lot of empty values which we will have to figure out how to deal with, and that about half of the animals with estimated ages are older than a year and younger than 7 years.
u'Age': {'Older than 7 years': 1250, 'Infant - Younger than 6 months': 1143, 'Older than 1 year': 3794, 'Youth - Younger than 1 year': 915, 'EMPTY': 2533}
This format is still fairly difficult to deal with, so I have created a sample script which will write some of these summary statistics out as csv files. It can be found as 'explore.py' in the github repository for assignment 2 and is carefully commented. In addition to making csv files and printing out statistics, 'explore.py' uses pyplot to graph some of the numeric data (if you close the first plot window the next will appear). For example, here is a scatterplot of the range of latitude and longitude values in the data:
You will need to explore the resulting csv files and potentially modify the python code directly to complete this assignment. We recommend you at least read through the python code and its comments, which provides some hints for further analysis and uses some techniques you may want to use later as well.
Questions you should be able to answer when you hand this in
When you are done, you should be able to answer questions such as (some of these questions will be asked at hand in time, and you will need to provide evidence to support your answers -- you can see these questions on blackboard)
- Is the data complete? If not, what is missing and what concerns does it raise?
- Is the data coherent?
- Does it make sense relative to itself?
- Does it match your expectations?
- Is the data is distributed across its possible values in a way that makes sense?
- Is the data correct?
- What aspects of this data can you verify?
- What did you find?
- Is the data accountable?
- Where did it come from?
- Who has access to it?