In this project, you will create a new application that displays data from your Twitter account. You will need to use OAuth to access this information. For this assignment, we are not yet going to start dealing with language (a common use of twitter data). Instead we would like to explore the numerical side of data analysis. For this we will be counting tweets that match a search string. This assignment has the following learning goals:
Byte 5 will build on the concepts used in Bytes 1-4. You should start by setting up a second google application, called [yourname]-byte5.appspot.com.
We will be using the streaming API from twitter, which is a RESTful API. We are going to use the twitter's application only authentication, which is much simpler than application + user authentication, and can give us access to search the tweet stream. To gain access to the Twitter API, you first need an authorization (a special password just for this application) from twitter (you will need a twitter account to obtain this). There are several types of authorizations supported by Twitter. For our purposes, the dev.twitter.com option will work well.
Go ahead and create your application, call it [yourname]-byte2. When you create this applications, you will be asked to provide a url for the application. You can use 'http://[yourname]-byte2.appspot.com' (without the quotes, of course). Further details on how to set up these applications see the instructions provided by Twitter (you can always go back and change these settings). When you are done creating the application, you should have a set of tokens called CONSUMER_KEY and CONSUMER_SECRET. Be sure to save both tokens.
There are a number of python libraries for twitter. However, the learning goals for this project are for you to be able to access any API, not just Twitter. Thus, we will be using lower level libraries to do authentication ourselves. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google, similarly to how we installed feedburner in the last project. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For twitter access, we will be using
When you are done setting all of this up, the header of your main.py file should look something like this:
import base64
import webapp2
import logging
from webapp2_extras import jinja2
from webapp2_extras import json
import httplib2
import urllib
The work we are about to embark on can be somewhat complex, and involves multiple moving parts. It is helpful to know about several options for tracing what is happening:
logger.info(...)
frequently to double check that things are proceeding as you expect.For our implementation, we will follow the documentation on twitter for application only authentication, which very clearly lays out the process for requesting authorization. To make things simple, we will store the key information we need to access twitter in four global variables. Note that this is not a very secure way to do things, having a "secrets" file that you load them from would be an improvement, for example.
# this is the URL we will use to get our authorization token (stage one of the authorization)
TWITTER_REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth2/token'
# this is the search URL we will use at the end
TWITTER_STREAM_API_PATH = 'https://api.twitter.com/1.1/search/tweets.json'
# this is the set of keys we need to get our authorization token
TWITTER_CONSUMER_KEY = 'HxmQOAxoU9fnmrLOx7rbw'
TWITTER_CONSUMER_SECRET = '9awHN0NEsFuVcwZHaLqKIjPi9r6FIN2D52rcgw8'
In addition, we need to create an http client which is also going to be global so that we can send 'POST' and 'GET' requests:
# this is how we will send requests to twitter to get our authorization token
http = httplib2.Http()
The next step is to define one function for each step of the oauth. Note that right now we are doing no error checking, but it would be easily possible to check for errors by looking at the "response" portion of each http request. The first function will be called:
def oauth_step1_get_authorization_token(self):
It is a straight setup of the headers described in the twitter tutorial for application only authorization, including specific text for 'Content-Type', 'Accept-encoding', and 'Authorization'. The most complex part of this, shown below, is setting up the authorization correctly:
credentials = "{0}:{1}".format(key, secret)
credentials = base64.b64encode(credentials)
headers['Authorization'] = "Basic {0}".format(credentials)
Once all the headers are set up, we simply call
resp, content = http.request(token_request_url,
'POST', headers=headers, body=content)
content
will be text in serialized json format. We need to decode it, and retrieve the access token. You may want to read up on python supports json for storing and accessing content once decoded.
tokens = json.decode(content)
return tokens['access_token']
Note that this the access token (tokens['access_token']) in this type of oauth does not change. You could store it in a file and retrieve it, and unless you specifically ask twitter to revoke the token, it will work.
The second function will be called 'auth_step2_make_api_request'. It takes as input an authorization token and a set of search terms. Once again we need to set headers following the twitter tutorial for application only authorization, including 'Accept-encoding' and the 'Authorization' header (this time using the token):
headers['Authorization'] = "Bearer {0}".format(token)
However we also need to construct a search URL using parameters supported by twitter. For example, we can specify that we only want 20 results using count=20
. This also requires encoding the terms in a form that will work in the URL to take care of funny characters like spaces using urllib.quote().
terms = urllib.quote(terms)
url = "{0}?q={1}&result_type=recent&count=20".format(
TWITTER_STREAM_API_PATH, terms)
Finally, we run the search and decode the results:
resp, content = http.request(token_request_url, 'POST',
headers=headers, body=content)
tokens = json.decode(content)
Now that we have the json decoded list of tweets retrieved using our search, it's time to display them. tokens
contains two things: A list of tweets (stored as 'statuses') and other meta data ('search_metadata'). First, however, we need to understand how json stores tweets. Tweets include a surprisingly complex set of information, made harder to understand by the presence of deprecated (old, unsupported) meta data. Luckily, twitter provides comprehensive documentation. For our application, we will want something like the 'text' attribute and the 'user'->'name' attribute.
However, before we proceed, it's important to understand that the results make use of unicode, a standard for displaying an international character set. Reading a great tutorial on unicode in python is a good place to start, as it can be very confusing. Unfortunately, python 2.7 does not use unicode by default in its strings, so we need to deal with this explicitly using (python provides a unicode tutorial). Jinja does expect unicode. So we will need to do some conversions along the way.
# We need a place to store the tweets so we can
# pass them to jinja
tweets = []
# Next, we need to iterate through the tokens in
# the status portion of our results:
for tweet in tokens['statuses']:
# next we extract the text for the tweet and unicode it
text = unicode(tweet['text'])
# and unicode the user's name
name = unicode(tweet['user']['name'])
# finally we place the tweet into the tweets
tweets = tweets + [[text, name]]
# and add the result to our context
context = {"search": terms, "tweets":tweets}
# so that we can render the tweets
self.render_response('index.html', context)
Because we are dealing with textual data here, cleaning and examining the data is not easily handled using statistical techniques and graphs. Instead we need to start by looking at the content of the text, the characters on the screen and so on. Below are some results for a search for "xmas". There are two problems illustrated below (see if you can figure out what might be problematic before you read further).
The first tweet shown below is not in English. This may not be a problem for everyone but for me twitter results I cannot read are not all that useful. The second one has squares in it (just after the "...christmas."). Those are probably characters that are not being displayed correctly. Let's work through how to address each of these.
Identifying the language of text could be done in several ways. However in the case of twitter, we are lucky that every tweet comes with a 'lang' indicator (as described in twitter's tweet documentation). We simply need to call
lang = tweet['lang']
and then check
if lang = 'en'
to decide whether to display a tweet. After adding this into 'main.py' in the post
function and double checking that it works correctly, foreign language tweets seem to have disappeared from my results.
To identify the source of the problem with the "boxes" mixed in with text, I turned to the log. I searched for "...christmas" in the log and found the original text as follows:
u'...christmas. \U0001f385\U0001f384 #christmas #xmas #family #instagood #like @ Home http://t.co/WPu6Sn12t5'
I did some detective work and double checked the first code at http://unicodelookup.com (I had to convert it to hexadecimal and lookup 0x0001f385. Sure enough, it is undefined. The next question is: what does it encode if not unicode? Amazingly, I found the answer further down in my log file:
It looks like an emoticon encoding. Further research suggested that the problem was not my code but rather the font set associated with the browser I was using. Sure enough -- boxes in chrome, nothing in firefox, and icons in safari (and searching for the same tweet in Twitter's own interface I saw the same results).
For this assignment, we are not yet going to start dealing with language directly. Instead we would like to explore the numerical side of data analysis. Let's start by printing the total number of tweets, and then organize them by day, week, and month.
First we need to get the date of each tweet, which can be found in tweet['created_at']
. Time is provided in UTC format as in Wed Aug 27 13:08:45 +0000 2008. To parse the date, and convert it into local time required a fair amount of googling on my part due to the fact that python's standard date time libraries do not handle timezones very intuitively. In the end, after reading extensively and thanks to a stackoverflow comment pointing at work done by the Jeff Miller at blackbirdpy I learned that the best library for parsing the data is in the email.utils library. I tell this story by way of noting that we all have to search for solutions sometimes! Here's the import:
# libraries needed for parsing and using dates
from datetime import datetime
import email.utils
And the helper function that I wrote based on blackbirdpy (insert this into your MainHandler class).
# copied and modified from blackbirdpy (https://github.com/jmillerinc/blackbirdpy)
def timestamp_string_to_datetime(text):
"""Convert a string timestamp of the form 'Wed Jun 09 18:31:55 +0000 2010' into a Python datetime object."""
tm_array = email.utils.parsedate_tz(text)
tweet_created = datetime.datetime(*tm_array[:6]) - datetime.timedelta(seconds=tm_array[-1])
tweet_local_datetime = tweet_created_datetime +
(datetime.datetime.now() - datetime.datetime.utcnow())
return tweet_local_datetime
We can now extract the date of the tweet, in local time, by calling:
# the date of the tweet
date = timestamp_string_to_datetime(tweet["created_at"])
We may want to handle retweeted tweets differently. We can find them by checking tweet['retweeted_status']
, which will exist only if the tweet is a retweet:
not_original = 'retweeted_status' in tweet.keys()
A working application based on this tutorial will be sufficient to get full credit for this byte. An example can be found at http://jmankoff-byte2.appspot.com/ You can get extra credit if you impress the grader. Below are some ideas