Byte 6 [IN PROGRESS]: Working With Text

  • Description: Build an application that can download twitter data. Make sure the data is "clean" and in the right language. ...
  • Due date: See Blackboard
  • Grading: More details to be provided.

Overview

In this project, you will create a new application that displays data from your Twitter account. You will need to use OAuth to access this information. For this assignment, we are not yet going to start dealing with language (a common use of twitter data). Instead we would like to explore the numerical side of data analysis. For this we will be counting tweets that match a search string. This assignment has the following learning goals:

  • Calculating basic statistics
  • Data cleaning
  • ...

Detailed instructions for Byte 5

Byte 5 will build on the concepts used in Bytes 1-4. You should start by setting up a second google application, called [yourname]-byte5.appspot.com.

Setting things up on the Twitter side

We will be using the streaming API from twitter, which is a RESTful API. We are going to use the twitter's application only authentication, which is much simpler than application + user authentication, and can give us access to search the tweet stream. To gain access to the Twitter API, you first need an authorization (a special password just for this application) from twitter (you will need a twitter account to obtain this). There are several types of authorizations supported by Twitter. For our purposes, the dev.twitter.com option will work well.

Go ahead and create your application, call it [yourname]-byte2. When you create this applications, you will be asked to provide a url for the application. You can use 'http://[yourname]-byte2.appspot.com' (without the quotes, of course). Further details on how to set up these applications see the instructions provided by Twitter (you can always go back and change these settings). When you are done creating the application, you should have a set of tokens called CONSUMER_KEY and CONSUMER_SECRET. Be sure to save both tokens.

Importing all of the necessary libraries

There are a number of python libraries for twitter. However, the learning goals for this project are for you to be able to access any API, not just Twitter. Thus, we will be using lower level libraries to do authentication ourselves. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google, similarly to how we installed feedburner in the last project. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For twitter access, we will be using

  • httplib2 from the Google APIs client library. You can get this from google, be aware that the installation instructions are a bit confusing: you need to scroll down to the section titled "App Engine" and then follow the instructions to download the latest version of google-api-python-client-gae-N.M.zip and install it in your [yourname]-byte2 directory (I unzipped the file, and then copied the subdirectory for httplib2 into that directory). This is similar to the way we installed feedparser in byte 1.
  • json and jinja2 from webapp2_extras
  • base64 (a standard library for encoding base 64 required by the oauth protocol)
  • urllib (a standard library for working with urls)

When you are done setting all of this up, the header of your main.py file should look something like this:

import base64
import webapp2
import logging
from webapp2_extras import jinja2
from webapp2_extras import json
import httplib2
import urllib

Debugging as you go

The work we are about to embark on can be somewhat complex, and involves multiple moving parts. It is helpful to know about several options for tracing what is happening:

  • When you are exploring solutions to a problem of this sort, we highly recommend using logger.info(...) frequently to double check that things are proceeding as you expect.
  • Don't forget to view the log itself, which will have not only your printouts, but also other bugs that arise, what exceptions happen, and so on
  • You can try out the get and post requests you make if you direct them to RequestBin rather than twitter. Once you run the 'GET' or 'POST' refresh the page that RequestBin gives you and you can see exactly what you sent to twitter. Compare this to the tutorial to see if you have everything right.

Implementing the application only OAuth protocol

For our implementation, we will follow the documentation on twitter for application only authentication, which very clearly lays out the process for requesting authorization. To make things simple, we will store the key information we need to access twitter in four global variables. Note that this is not a very secure way to do things, having a "secrets" file that you load them from would be an improvement, for example.

# this is the URL we will use to get our authorization token (stage one of the authorization)
TWITTER_REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth2/token'
# this is the search URL we will use at the end
TWITTER_STREAM_API_PATH   = 'https://api.twitter.com/1.1/search/tweets.json'
# this is the set of keys we need to get our authorization token
TWITTER_CONSUMER_KEY      = 'HxmQOAxoU9fnmrLOx7rbw'
TWITTER_CONSUMER_SECRET   = '9awHN0NEsFuVcwZHaLqKIjPi9r6FIN2D52rcgw8'

In addition, we need to create an http client which is also going to be global so that we can send 'POST' and 'GET' requests:

# this is how we will send requests to twitter to get our authorization token
http = httplib2.Http()

The next step is to define one function for each step of the oauth. Note that right now we are doing no error checking, but it would be easily possible to check for errors by looking at the "response" portion of each http request. The first function will be called:

def oauth_step1_get_authorization_token(self):

It is a straight setup of the headers described in the twitter tutorial for application only authorization, including specific text for 'Content-Type', 'Accept-encoding', and 'Authorization'. The most complex part of this, shown below, is setting up the authorization correctly:

credentials = "{0}:{1}".format(key, secret)        
credentials = base64.b64encode(credentials)
headers['Authorization'] = "Basic {0}".format(credentials)

Once all the headers are set up, we simply call

resp, content = http.request(token_request_url, 
                             'POST', headers=headers, body=content)

content will be text in serialized json format. We need to decode it, and retrieve the access token. You may want to read up on python supports json for storing and accessing content once decoded.

tokens = json.decode(content)
return tokens['access_token']

Note that this the access token (tokens['access_token']) in this type of oauth does not change. You could store it in a file and retrieve it, and unless you specifically ask twitter to revoke the token, it will work.

The second function will be called 'auth_step2_make_api_request'. It takes as input an authorization token and a set of search terms. Once again we need to set headers following the twitter tutorial for application only authorization, including 'Accept-encoding' and the 'Authorization' header (this time using the token):

headers['Authorization'] = "Bearer {0}".format(token)

However we also need to construct a search URL using parameters supported by twitter. For example, we can specify that we only want 20 results using count=20. This also requires encoding the terms in a form that will work in the URL to take care of funny characters like spaces using urllib.quote().

terms = urllib.quote(terms)
url = "{0}?q={1}&result_type=recent&count=20".format(
                                           TWITTER_STREAM_API_PATH, terms)

Finally, we run the search and decode the results:

resp, content = http.request(token_request_url, 'POST',
                             headers=headers, body=content)
tokens = json.decode(content)

Understanding and displaying tweets

Now that we have the json decoded list of tweets retrieved using our search, it's time to display them. tokens contains two things: A list of tweets (stored as 'statuses') and other meta data ('search_metadata'). First, however, we need to understand how json stores tweets. Tweets include a surprisingly complex set of information, made harder to understand by the presence of deprecated (old, unsupported) meta data. Luckily, twitter provides comprehensive documentation. For our application, we will want something like the 'text' attribute and the 'user'->'name' attribute.

However, before we proceed, it's important to understand that the results make use of unicode, a standard for displaying an international character set. Reading a great tutorial on unicode in python is a good place to start, as it can be very confusing. Unfortunately, python 2.7 does not use unicode by default in its strings, so we need to deal with this explicitly using (python provides a unicode tutorial). Jinja does expect unicode. So we will need to do some conversions along the way.

# We need a place to store the tweets so we can 
# pass them to jinja
tweets = [] 
# Next, we need to iterate through the tokens in 
# the status portion of our results:
for tweet in tokens['statuses']:
    # next we extract the text for the tweet and unicode it
    text = unicode(tweet['text'])
    # and unicode the user's name
    name = unicode(tweet['user']['name'])
    # finally we place the tweet into the tweets
    tweets = tweets + [[text, name]]
# and add the result to our context
context = {"search": terms, "tweets":tweets}
# so that we can render the tweets
self.render_response('index.html', context)

Cleaning the data

Because we are dealing with textual data here, cleaning and examining the data is not easily handled using statistical techniques and graphs. Instead we need to start by looking at the content of the text, the characters on the screen and so on. Below are some results for a search for "xmas". There are two problems illustrated below (see if you can figure out what might be problematic before you read further).

The first tweet shown below is not in English. This may not be a problem for everyone but for me twitter results I cannot read are not all that useful. The second one has squares in it (just after the "...christmas."). Those are probably characters that are not being displayed correctly. Let's work through how to address each of these.

Which Language Do I want

Identifying the language of text could be done in several ways. However in the case of twitter, we are lucky that every tweet comes with a 'lang' indicator (as described in twitter's tweet documentation). We simply need to call

lang = tweet['lang']

and then check

if lang = 'en' 

to decide whether to display a tweet. After adding this into 'main.py' in the post function and double checking that it works correctly, foreign language tweets seem to have disappeared from my results.

Unicode wasn't enough

To identify the source of the problem with the "boxes" mixed in with text, I turned to the log. I searched for "...christmas" in the log and found the original text as follows:

u'...christmas. \U0001f385\U0001f384 #christmas #xmas #family #instagood #like @ Home http://t.co/WPu6Sn12t5'

I did some detective work and double checked the first code at http://unicodelookup.com (I had to convert it to hexadecimal and lookup 0x0001f385. Sure enough, it is undefined. The next question is: what does it encode if not unicode? Amazingly, I found the answer further down in my log file:

It looks like an emoticon encoding. Further research suggested that the problem was not my code but rather the font set associated with the browser I was using. Sure enough -- boxes in chrome, nothing in firefox, and icons in safari (and searching for the same tweet in Twitter's own interface I saw the same results).

Counting Tweets

For this assignment, we are not yet going to start dealing with language directly. Instead we would like to explore the numerical side of data analysis. Let's start by printing the total number of tweets, and then organize them by day, week, and month.

First we need to get the date of each tweet, which can be found in tweet['created_at']. Time is provided in UTC format as in Wed Aug 27 13:08:45 +0000 2008. To parse the date, and convert it into local time required a fair amount of googling on my part due to the fact that python's standard date time libraries do not handle timezones very intuitively. In the end, after reading extensively and thanks to a stackoverflow comment pointing at work done by the Jeff Miller at blackbirdpy I learned that the best library for parsing the data is in the email.utils library. I tell this story by way of noting that we all have to search for solutions sometimes! Here's the import:

# libraries needed for parsing and using dates
from datetime import datetime

import email.utils

And the helper function that I wrote based on blackbirdpy (insert this into your MainHandler class).

 # copied and modified from blackbirdpy (https://github.com/jmillerinc/blackbirdpy)
    def timestamp_string_to_datetime(text):
        """Convert a string timestamp of the form 'Wed Jun 09 18:31:55 +0000 2010' into a Python datetime object."""
        tm_array = email.utils.parsedate_tz(text)
        tweet_created = datetime.datetime(*tm_array[:6]) - datetime.timedelta(seconds=tm_array[-1])
        tweet_local_datetime = tweet_created_datetime + 
                       (datetime.datetime.now() - datetime.datetime.utcnow())
        return tweet_local_datetime

We can now extract the date of the tweet, in local time, by calling:

# the date of the tweet
date = timestamp_string_to_datetime(tweet["created_at"])

We may want to handle retweeted tweets differently. We can find them by checking tweet['retweeted_status'], which will exist only if the tweet is a retweet:

not_original = 'retweeted_status' in tweet.keys()

Thoughts about ways to take this further (depending on your interests and skillset):

A working application based on this tutorial will be sufficient to get full credit for this byte. An example can be found at http://jmankoff-byte2.appspot.com/ You can get extra credit if you impress the grader. Below are some ideas

  • Right now, we don't cache the authorization. This will cause occasional errors because twitter limits how often we can retrieve the authorization, and is silly since it doesn't change. We could store it in a file, in which case we only have to request it once, or we could store as a session variable.
  • The type of authorization we implemented is limited to only part of Twitter's API. To access per-user data, you need to do per-user authorization and then modify your authorization implementation to follow Twitter's guidelines for obtaining access tokens on behalf of a user. Note that for testing on your local machine in this case you'll want to create a second twitter application (call it [yourname]-byte2-dev), which would have to redirect to that machine instead of the appspot application.
  • No error checking: The authentication procedure can go wrong for many different reasons. It would be a good idea to check the responses coming back from twitter for standard errors before parsing the content, and handle the errors appropriately.
  • We have created a basic text display for the tweets. How about replacing this with correct tweet parsing to show the images as images and link all the hash tags to new searches for that hash tag?
    • http://www.nltk.org/
    • Google's sentiment predictor at https://developers.google.com/prediction/docs/gallery#hosted_model

Some questions you should be able to answer

  • What are some other ways you might have been able to address the language problem (if Twitter had not provided a solution)?
  • What are some ways to eliminate spam tweets that do not require natural language processing?