GeoTweet

Overview:

With the rise of social media in the 21st century, radical opinions have been popularized and can easily influence public opinion. Global issues can snowball into huge controversies on the internet, which can obscure the true representation of support or disagreement on certain topics. These opinions are heavily valued by large corporations looking to push a product, or presidential candidates running for election. Previously, trying to gather this summarized information would require considerable resources and expert personnel. However, using sentiment analysis, our project allows us to use the readily available data on Twitter to easily visualize the extent of support in each country around the globe. Our project maps this data on a choropleth map to create a informative, yet beautiful graphic.

Choropleth map of the world's sentiment on weather:

Extracting Data:

The data we extracted was purely from Twitter. We had the option to either extract a set from past tweets, or to livestream the data into our files. We chose the latter.

In order to extract tweets, we needed to access Twitter's API. To get access to the API, individual keys and access tokens were generated through our account in Twitter Apps. These access tokens, paired with a python module named Tweepy, we were able to stream the data from Twitter using the following code through a class we created, called myStreamListener:

Live Stream of tweets:

streamingTweetsVid1.mov

Detecting Sentiment:

The parts of the Twitter data we used were the Twitter user's location as specified in their profile and the text in their live tweets. To perform sentiment analysis, we needed the latter portion of the data: the text in the tweets themselves.

But first, what is sentiment analysis? In natural language processing (NLP), sentiment analysis aims to discern the attitude of the speaker or writer of a piece of text towards a topic discussed within that text. In this case, sentiment analysis is employed on tweets that mention the keywords like "trump" or "feminism." We are attempting to infer whether these tweets express positive or negative feelings about Donald Trump or feminism and how strong those feelings are.

Before performing the sentiment analysis, we need to clean up the tweets to discard extraneous content like embedded urls, hashtags, media (images & videos), user mentions, and symbols. By checking if each of these were contained in an individual tweet's tweet.entities we were able to record the indexes of each appearance then slice the tweet to eliminate the unnecessary items.

Now that we have clean tweets comprised entirely of relevant text, we can convert them into blob objects and use TextBlob to conduct our sentiment analysis. TextBlob is a Python library made specifically to process textual data and it comes equipped with a function textblob.sentiments that, when used on a blob object outputs two values: a polarity value and a subjectivity value. The polarity value can be anywhere between -1.0 and 1.0 inclusive and tells us how negative or positive the attitude of the text is. A subjectivity rating shows how subjective or objective a statement is but is not relevant to our work here.

As we collect tweets from countries around the world, we can calculate both the sum of total polarity values from users we found in those countries and the average polarity value of the users from those countries.

Extracting Location and Mapping:

To us, this was surprisingly the most difficult part of the project.

Using Tweepy, we were able to list the entities of a single tweet, including geo_enabled, which signified if they had a set location. However, even if this was 'True', a user could override a GPS location with their own tag, which often consisted of patriotic phrases and emojis that contributed to uneven formatting. This created the task of filtering this out and finding the countries the Twitter users are actually from based on their location descriptions.

First, we had decided that even though the majority of the output of tweets would come from the United States, we did not want to limit ourselves to one country. We used GeoText to get the countries from the location descriptions.

As we collected live tweets, we added each country we found at least one tweet from into a dictionary with the country's name as the key and the number of tweets from that country as well as the sum of the polarities of the those tweets as the value. After we have collected a sufficiently large number of tweets, we averaged the polarities from each country and added these values alongside the country name into a csv file.

Once we had a csv file with the names and average polarities of the countries our live tweets were from, we were able to create a choropleth map, which used varying colors and shades of color for each country to represent the varying average polarity for each country.

We were able to do this using Folium, which allowed us to visualize our data on an interactive Leaflet map after first manipulating our data using Python.

Choropleth map of the world's sentiment on "Happy":

Project Limitations

Currently, doing simple sentiment analysis on a tweet containing the keyword (e.g. "feminist") does not necessarily reflect the user's feelings about the topic (e.g. feminism or feminists), rather it showcases the user's direct feelings in that tweet. Sometimes, a tweet will be labelled negative if the tone is generally negative even if support for the topic can still be inferred by a person.
The data is collected live, which means it may not be proportionately representative of the global population. With obstacles like varying time zones, how much data collected from each country is affected by how likely users are to be online at a particular time.
The dataset collected is not yet large enough to provide a confidently representative perspective and many more serious issues cannot be fully understood on so simple a spectrum.

Google Sites

Report abuse