Three Little Birds: Twitter Data Visualization


Project Description

The Three Little Birds project is an analysis of over 50 GB of twitter data taken from the Stanford twitter7 dataset. This data collected 475 million tweets from 17 million users between June and December of 2009. As a group we were to decide how to interpret this data, visualize it, and also incorporate a dynamic twitter feed of current tweets into the project. Using this data and our visualization we were to see if there were patterns in the data that we could find.

We were interested in answering questions about the use of words over time. We wanted to know whether words usage varies over time, over the course of a day, and during the course of run of the application. With our application, a user is able to search the static dataset for interesting words, and can look for patterns over June-Dec 2009, over the course of a day, and while the application is open. 


Obtain and run the application

Standalone applications:  To run, double click on the .exe or .app file. 
Source code and application for mac, windows and linux 
    Here
Tweet Stream Library needed to run the program from source (place inside libraries directory in sketchbook)
    Here


Data

The data from the Stanford set was in the format:

T 2009-12-01 00:00:31
U http://twitter.com/emilyrecord
W I wonder how often you think of me, if ever. Today is not a happy day.

We decided that in order to condense the data down to a smaller size, each tweet should be a single line and not include the usernames. This way we were able to have a easy way to access each part of the tweet we needed, such as the date, time, and the message itself. When trying to decide how to visualize the data we decided the best way would be to break up the tweets into key words. That way we could analyze how certain words are tweeted more often then others and how they relate to certain events.

Initially, the text file for June was stripped down to only tweets with no date and time. Then, using the bash 'tr' command the tweets were broken down to a single word per line. This was then sorted and piped to uniq -c and sorted again. After eliminating certain words such as "the", "and", "a", and only counting the top 1500 words, we had a good idea of what words were worth comparing. Adding on some popular hashtags, news events, and Celebrity names also gave us a great list of 1,750 words that the user of our application will be able to use to compare and come up with some analysis with the data.

We also had a program running on a computer for the week of 11/11/2010 - 11/18/2010 that was recording a stream of tweets from Los Angeles, Chicago, and New York.

Once all of the files that we were collecting were put into the same format as the modified Stanford data, we were able to write a Java program that took all of the keywords we were looking for into a HashMap and created a 2D array for each word that kept track of the count of times it was tweeted for each hour and each day. After doing the same to the current stream of data we were collecting, we ended up with text files that had the counts per day and averages per hour for each of the 1,750 words we were keeping track of.

The processing program contains a HashTable of words associated the word key with an object that holds all data types associated with the word, including usage over time in different data sets. As a tweet comes into the program, the HashTable is checked to see if that word is included in the data set. If it is, a counter is incremented in the appropriate ArrayList position, one position for each second. In this manner, as the program run, it efficiently keeps track of the number of times a word is tweeted in a given second since the application was 'turned on'. If a word is selected, this data is displayed in a dynamic line plot.

Components

Our program includes the following components:

Search boxes for word1 and word2. As you type, the list of possible input words are displayed. When you press 'Enter', that word is selected, and all plots and data sets are updated to present the data for the selected word.


Next to the search boxes are bar charts showing average hourly usage of a word in tweets from the stanford data set, and the recent data set from LA, NY and Chicago. The final column in the bar chart displays the streaming data set, predicted hourly usage based on the current rate of usage of a word in a second.


Next to the search boxes is a line plot showing streaming data as it comes in, which each position representing one second. It will reach the end of the plot after 10 minutes, and will then shift one frame each second. In this manner, the application can run indefinitely, with a constant display of word usage over time.


Next to the dynamic line plot is a dynamic word cloud. This cloud shows the 15 most popular words from our word set tweeted in incoming stream. This cloud updates every 50 tweets.


If the stream does not appear to be updating, the user may select 'reset stream', to direct the program to reconnect. See notes about issues with twitter stream below.

All of this sits on the top half of the application.

Beneath these plots, we have displayed a plot of word1 vs word2 from June 7th, 2009 until Dec 31, 2009. Specific information about this plot may be gained by hovering over data points. In addition, the slider bar beneath the plot may be dragged to select a region of the plot. When this is done, the bar chart and word cloud data updates to show information for the selected time period. It is also possible to view these words as a percentage of all words from our popular word set in a day, or relative to the average number of times this word is used in a day. These changes may be made by selecting the 'View by Percentage", "View Rel. To Avg" or "View as Counts" button to the right.

Beneath the plot of word usage over time, you can view the 20 most popular words over the 2009 data set. If the slider is moved, the word cloud is updates along with the graphs.


Next to the static word cloud there are 3 plots. One showing the average number of tweets throughout the day and the other 2 showing the selected words average usage throughout the day.   You can select these plots by clicking the buttons above the plot view.


Data Analysis

Using our application, it is possible to compare two words from our list of 1750 and analyze and predict trends. Here are just a few examples:

This search is comparing facebook to MySpace. As it can be seen in the graphs, a lot more people tweet about facebook then they do about MySpace. This information is helpful for people determining a popularity trend between products and can be used by marketers and business to determine if their product is being talked about.


A good example on how people use twitter to convey emotions is comparing emoticons.  In this example we are comparing the :) with :(
This is showing that more people use the smiley face.  It is clear to see in the current streaming data set, that this is true in the present as well as in the Stanford 2009 data set.


This comparison is showing how people tend to use the word, "Happy" along with the smiley-face. In one of the large spikes you can see that "Happy" rose up drastically because it was the Thanksgiving holiday.


This example shows how the deaths of Farrah Fawcett and Michael Jackson who both died on June 25th, 2009 affected people on Twitter. The tweets about Michael Jackson rose significantly and continued to have spikes slowly tapering off by the end of the year. However Farrah Fawcett was tweeted about around her death, but quickly was out of the twitter feeds.


This example shows the popularity between the iPhone and Android. There was a slight increase in the iPhone around when it was released in June but then rose up by a lot around August 1. I wasn't able to find any data showing why there would be this increase in August, but perhaps the phones were sold out at a lot of places and by August more people were able to buy them. We can also see that though Android is tweeted about less it gradually got larger and by the end of the year was tweeted about as much as the iPhone.


Moving the slider reveals interesting patterns in the popular word usage. For example, on the 4th of July, the word cloud shows that 'free', 'sing', 'Palin', 'followers'. Free and sing are reflective of the holiday. Sarah Palin delivered a 4th of July message in 2009, and perhaps this is why her name comes up among popular words on that day.



On Christmas day, 2009, some of the popular words were family, happy, hope, love good, and day.


Shown in the word cloud below are some of the popular words used on November 23rd, 2009, a year before our project was due.


Below are screen shots of some words through the day









Coding and Issues

Libraries used:
1. ControlP5 ()
2. TweetStream

Websites used:
http://colorbrewer2.org/
http://www.colourlovers.com/

The code for the word cloud is adapted from the processing code, 'Tag Cloud' by Wray Bowling.

Issues with Twitter Stream:
We worked extensively to fix the problem of maintaining connection to the twitter stream by adding modifications to the twitter stream library.  These modifications altered the way in which the stream attempted to reconnect in the event of a disconnect. Further, we added a 'reset stream' button to re-establish a connection to
the stream.  However, there seem to be mechanisms beyond our control effecting the connection tot the stream.  Perhaps there are limits on the number of connections a user can establish to the stream.  Alternatively, a spotty internet connection might make it difficult to establish a connection.

Of greater concern is the fact that if a stream is not established at the beginning, it rarely gets established without turning off the application. We have tried to troubleshoot this problem, and hope it does not affect the users experience too greatly.
Comments