[Bluemix-Spark-Python] Sentiment Analysis of Twitter Hashtags

Since this tutorial was published, we’ve made some strides in notebook technology. To save yourself some work and learn more, try an updated version of my Real-time Sentiment Analysis of Twitter Hashtags tutorial.

How’s your relationship with your customers? You can track how consumers feel about you, your products, or your company based on their tweets. Gauge positive or negative emotions measured across multiple tone dimensions, like anger, cheerfulness, openness, and more. To get real-time sentiment analysis, set up Spark Streaming with Twitter and Watson on Bluemix and use its Notebook to analyze public opinion.

This tutorial covers how to build this app from the source code, configure it for deployment on Bluemix, and analyze the data to produce compelling, insight-revealing visualizations.

How it works

This sample app uses Spark Streaming to create a feed that captures live tweets from Twitter. You can optionally filter the tweets that contain the hashtag(s) of your choice. The tweet data is enriched in real time with various sentiment scores provided by the Watson Tone Analyzer service (available on Bluemix). This service provides insight into sentiment, or how the author feels. We then use Spark SQL to load the data into a DataFrame for further analysis. You can also save the data into a Cloudant database or a parquet file and use it later to track how you’re trending over longer periods.

The following diagram shows the basic architecture of this app:

Architecture of Twitter sentiment analysis solution

Before you begin

If you haven’t already, read and follow the Start Developing with Spark and Notebooks tutorial, which shows how to build a custom library for Apache® Spark™ and deploy it to a Jupyter Notebook in IBM Analytics for Apache Spark on Bluemix.

Build the application

  1. Clone the source code on your local machine: git clone https://github.com/ibm-cds-labs/spark.samples.git
  2. Go to the sub-directory that contains the code for this application: cd streaming-twitter
  3. Compile and assemble the jar using the following command: sbt assembly to create an uber jar (jar that contains the code and all its dependencies).

    Note: Soon, I’ll update this tutorial with a link to information on how to assemble uber jars.

  4. Post the jar on a publicly available url, by doing one of the following:
    • Upload the jar into a github repository. (Note its download url. You’ll use in a few minutes.)
    • Or, you can use our sample jar, which is pre-built and posted here on GitHub .

Connect to Twitter

Create a new app on your Twitter account and configure the OAuth credentials.

  1. Go to https://apps.twitter.com/ . Sign in and click the Create New App button
  2. Complete the required fields:
    • Name and Description can be anything you want.
    • Website. Enter any valid URL.
  3. Below the developer agreement, turn on the Yes, I agree check box and click Create your Twitter application.
  4. Click the Keys and Access Tokens tab.
  5. Scroll to the bottom of the page and click the Create My Access Tokens button.
  6. Copy the Consumer KeyConsumer SecretAccess Token, and Access Token Secret. You will need them later in this tutorial.

Working with your Twitter API keys.

Initiate and run services on Bluemix

  1. On Bluemix, initiate the IBM Analytics for Apache Spark service.
    1. In the top menu, click Catalog.
    2. Under Data and Analytics, find Apache Spark.
      Finding Spark in the Bluemix services catalog
    3. Click to open it, and click Create.
    4. Click Open.
    5. Click the Object Storage tab.
    6. add_obj_stor

    7. Click the Add Object Storage button and click Create.
  2. Initiate the Watson Tone Analyzer service too. To do so:
    1. In Bluemix, go to the top menu and click Catalog.
    2. Scroll down to the bottom of the page and click the Bluemix Labs Catalog link.
    3. Select the Tone Analyzer service and click Create.
  3. On left side of the screen, click Service Credentials
  4. Copy the information (you’ll need it later when running the app in a Notebook):
    "credentials": {
  5. On the upper left of the screen click Back to Dashboard.
  6. Create a new Scala notebook.
    1. In Bluemix, open your Apache Spark service.
    2. If prompted, open an existing instance or create a new one.
      Creating a new Spark instance
    3. Click New Notebook.
    4. Enter a Name, and under Language select Scala.
    5. Click Create Notebook.
  7. In the first cell, enter the following to install the application jar created in the previous section:
    %AddJar https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.2.jar -f
  8. In the next cell, configure the credential parameters needed to connect to Twitter and Watson Tone Analyzer service. Enter the following, with your credentials inserted in the proper slots (replacing the XXs):
    val demo = com.ibm.cds.spark.samples.StreamingTwitter //Shorter handle
        //Twitter OAuth params from section above
        //Tone Analyzer service credential copied from section above
  9. Run the following input code to start streaming from Twitter and collect the data.


    import org.apache.spark.streaming._
        demo.startTwitterStreaming(sc, Seconds(30))


    Twitter stream started
    Tweets are collected real-time and analyzed
    To stop the streaming and start interacting with the data use: StreamingTwitter.stopTwitterStreaming
    Stopping Twitter stream. Please wait this may take a while
    Twitter stream stopped
    You can now create a sqlContext and DataFrame with 447 Tweets created.
    Sample usage:
                val (sqlContext, df) = com.ibm.cds.spark.samples.StreamingTwitter.createTwitterDataFrames(sc)
                sqlContext.sql("select author, text from tweets").show

    By default, the stream runs until you call the stopTwitterStreaming api. You may want to use an optional parameter to set a specific duration for the stream. In the input code above, we run the stream for 30 seconds: demo.startTwitterStreaming(sc, Seconds(30))

  10. Wait for 30 seconds or more for the stream to run. Once the stream stops, run the following input code to create a DataFrame using Spark SQL and start querying the data:


    val (sqlContext, df) = demo.createTwitterDataFrames(sc)


    A new table named tweets with 1247 records has been correctly created and can be accessed through the SQLContext variable
    Here's the schema for tweets
    |-- author: string (nullable = true)
    |-- date: string (nullable = true)
    |-- lang: string (nullable = true)
    |-- text: string (nullable = true)
    |-- lat: integer (nullable = true)
    |-- long: integer (nullable = true)
    |-- Cheerfulness: double (nullable = true)
    |-- Negative: double (nullable = true)
    |-- Anger: double (nullable = true)
    |-- Analytical: double (nullable = true)
    |-- Confident: double (nullable = true)
    |-- Tentative: double (nullable = true)
    |-- Openness: double (nullable = true)
    |-- Agreeableness: double (nullable = true)
    |-- Conscientiousness: double (nullable = true)
  11. Run this input to display a sample of the data.


    val fullSet = sqlContext.sql("select * from tweets limit 100000"//Select all columns


    author            date                 lang text                 lat long Cheerfulness Negative Anger Analytical Confident Tentative Openness          Agreeableness     Conscientiousness
    Lizzy Johnson     Sun Sep 27 20:18:... en   @CaylorxShacks an... 0.0 0.0  0.0          100.0    100.0 0.0        0.0       0.0       0.0               1.0               0.0
    26631stwc         Sun Sep 27 20:18:... en   Get Weather Updat... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       74.0              45.0              96.0
    Ayndrei?          Sun Sep 27 20:18:... en   RT @drycilagan: H... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0
    C.                Sun Sep 27 20:18:... en   RT @denisSDJEM: #... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       100.0     84.0              86.0              6.0
    Jason Brinker     Sun Sep 27 20:18:... en   RT @FirstBaptistJ... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       100.0     36.0              1.0               26.0
    Binary Trader Pro Sun Sep 27 20:18:... en   RT http://t.co/XX... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0
    Scully            Sun Sep 27 20:18:... en   @michaeldweiss @T... 0.0 0.0  0.0          100.0    0.0   0.0        0.0       0.0       11.0              0.0               0.0
    Nomi Garcia       Sun Sep 27 20:18:... en   I'm earning #mPOI... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       15.0              55.01             68.0
    Luke Robinson     Sun Sep 27 20:18:... en   RT @mzenitz: Amar... 0.0 0.0  100.0        0.0      0.0   0.0        0.0       0.0       30.0              72.0              100.0
    Info Jatim        Sun Sep 27 20:18:... en   Sebar Ratusan Sti... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0
    Chris Niosi       Sun Sep 27 20:18:... en   Beginning to see ... 0.0 0.0  0.0          0.0      0.0   100.0      0.0       0.0       48.0              5.0               12.0
    CR7               Sun Sep 27 20:18:... en   I question my soc... 0.0 0.0  0.0          0.0      0.0   100.0      0.0       0.0       0.0               97.0              9.0
    Steve Eggleston   Sun Sep 27 20:18:... en   @moxiemom I did. ... 0.0 0.0  97.0         0.0      0.0   93.0       100.0     0.0       0.0               98.0              2.0
    Robert            Sun Sep 27 20:18:... en   http://t.co/q9QvM... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0
    savannah neubert  Sun Sep 27 20:18:... en   @Teedubbz12 gnarl... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0
    Aixa J. Rudolph   Sun Sep 27 20:18:... en   I liked a @YouTub... 0.0 0.0  100.0        0.0      0.0   0.0        0.0       0.0       0.0               99.0              100.0
    Jay Pellecchia    Sun Sep 27 20:18:... en   @deborah_lary @GA... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       86.0              1.0               68.0
    Pond              Sun Sep 27 20:18:... en   RT @CraziestEyes ... 0.0 0.0  100.0        100.0    100.0 0.0        0.0       92.0      0.0               98.0              0.0
    Georgiee          Sun Sep 27 20:18:... en   Simba is the mvp ... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       57                1.0               100.0
    dolores luchavez  Sun Sep 27 20:18:... en   Love love love  #... 0.0 0.0  100.0        0.0      0.0   0.0        0.0       0.0       0.0               100.0             68.0
  12. Run the following code to save the dataset into a parquet file on Object Storage. (You’ll use this in the next section when you analyze the data with an IPython Notebook.)

    Disregard any warning messages related to SLF4J

    This command saves the file to your object store on Bluemix. To view stored files, go to the upper right of your screen and click Data Source.

  13. Run this other query example, which filters the data to show only tweets that have an Anger rating greater than 70%.


    val angerSet = sqlContext.sql("select author, text, Anger from tweets where Anger > 70")


    author             text                 Anger
    Lizzy Johnson      @CaylorxShacks an... 100.0
    Pond               RT @CraziestEyes ... 100.0
    Mychal Elliott     RT @TheTweetOfGod... 100.0
    Kat Willey         Girls gotta look ... 100.0
    Big Chan Trill OG? Bros don't have r... 100.0
    FIFA15 Messi Trick @M4DE_DARKf Luis ... 100.0
    Courtney Perkins   Why does Lauren t... 100.0
    Miami Celebs       We are great writ... 100.0
    InfoblazeCentral   @BillClinton blam... 100.0
    Dave               Argument from ign... 100.0
    Mispooky           @loghainroger_ SH... 100.0
    Nourelmalah        RT @stillawinner_... 100.0
    yella              RT @PoloMylogo_: ... 100.0
    cheyenne           RT @LRHASBOYFRIEN... 100.0
    anette             dont you hate it ... 100.0
    Savage Emily       RT @esterluvzpll:... 100.0
    Nicole Williams    The REAL men and ... 100.0
    Lexi Babyy         RT @Lowkey: peopl... 100.0
    la malinche        @luvhairyguys1 sc... 100.0
    ? MIGO ?           RT @Lowkey: peopl... 100.0

See more: View a copy of this Scala Notebook on GitHub .

Analyze the data using an IPython Notebook

In the previous section, using a Scala Notebook, you learned how to run the Twitter Stream to acquire data and enrich it with sentiment scores from Watson Tone Analyzer. You also ran a command to persist the data in a parquet file on the Object Storage bound to this Spark instance. Now, we’ll reload this data in an IPython Notebook for further analysis and visualization.

  1. From the Notebook main page, create a new Python Notebook.
    From within your Scala notebook, go to the upper left of the screen and click the back button to return to your My Notebooks page.

    Click New Notebook. Enter a Name, and under Language select Python. Then click Create Notebook.

  2. In a cell, run the following code to load the data and create a DataFrame with the entire dataset:
    # Import SQLContext and data types
    from pyspark.sql import SQLContext
    from pyspark.sql.types import *
    # sc is an existing SparkContext.
    sqlContext = SQLContext(sc)
    parquetFile = sqlContext.read.parquet("swift://notebooks.spark/tweetsFull.parquet")
    print parquetFile
    tweets = sqlContext.sql("SELECT * FROM tweets")
    print tweets.count()
  3. Now start analyzing this data. First, enter the following to compute the distribution of tweets by sentiment scores greater than 60%.
    #create an array that will hold the count for each sentiment
    sentimentDistribution=[0] * 9
    #For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%
    #Store the data in the array
    for i, sentiment in enumerate(tweets.columns[-9:]):
        sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")
  4. With the data stored in sentimentDistribution array, run the following code that plots the data as a bar chart.
    %matplotlib inline
    import matplotlib
    import numpy as np
    import matplotlib.pyplot as plt
    width = 0.35
    bar = plt.bar(ind, sentimentDistribution, width, color='g', label = "distributions")
    params = plt.gcf()
    plSize = params.get_size_inches()
    params.set_size_inches( (plSize[0]*2.5, plSize[1]*2) )
    plt.ylabel('Tweet count')
    plt.title('Distribution of tweets by sentiments > 60%')
    plt.xticks(ind+width, tweets.columns[-9:])


    Distribution of tweets by sentiment score greater than 60%

  5. Enter the following to compute the top 10 hashtags contained in the tweets. This code uses RDD transformations (flatMap, filter, etc…) to massage the data that will be used by the visualization code. Read more about the RDD APIs .
    from operator import add
    import re
    tagsRDD = tweets.flatMap( lambda t: re.split("s", t.text))
        .filter( lambda word: word.startswith("#") )
        .map( lambda word : (word, 1 ))
        .reduceByKey(add, 10).map(lambda (a,b): (b,a)).sortByKey(False).map(lambda (a,b):(b,a))
    top10tags = tagsRDD.take(10)
  6. Enter this visualization code to plot the data as a pie chart:
    %matplotlib inline
    import matplotlib
    import matplotlib.pyplot as plt
    params = plt.gcf()
    plSize = params.get_size_inches()
    params.set_size_inches( (plSize[0]*2, plSize[1]*2) )
    labels = [i[0] for i in top10tags]
    sizes = [int(i[1]) for i in top10tags]
    colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral', "beige", "paleturquoise", "pink", "lightyellow", "coral"]
    plt.pie(sizes, labels=labels, colors=colors,autopct='%1.1f%%', shadow=True, startangle=90)


    top 10 hashtags pie chart

  7. Now build a more complex report, which decomposes the top 5 hashtags by sentiment scores. Run the following code to compute the mean of all the sentiment scores and visualizes them in a multi-series bar chart.
    cols = tweets.columns[-9:]
    def expand( t ):
        ret = []
        for s in [i[0] for i in top10tags]:
            if ( s in t.text ):
                for tone in cols:
                    ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))]
        return ret
    def makeList(l):
        return l if isinstance(l, list) else [l]
    #Create RDD from tweets dataframe
    tagsRDD = tweets.map(lambda t: t )
    #Filter to only keep the entries that are in top10tags
    tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )
    #Create a flatMap using the expand function defined above, this will be used to collect all the scores
    #for a particular tag with the following format: Tag-Tone-ToneScore
    tagsRDD = tagsRDD.flatMap( expand )
    #Create a map indexed by Tag-Tone keys
    tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))
    #Call combineByKey to format the data as follow
    #Value=(count, sum_of_all_score_for_this_tone)
    tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)),
                      (lambda x, y: (x[0] + y, x[1] + 1)),
                      (lambda x, y: (x[0] + y[0], x[1] + y[1])))
    #ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple
    #Value=(Tone, average_score)
    tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1], 2))))
    #Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples
    tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )
    #Sort the (Tone,average_score) tuples alphabetically by Tone
    tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )
    #Format the data as expected by the plotting code in the next cell.
    #map the Values to a tuple as follow: ([list of tone], [list of average score])
    #e.g. #someTag:([u'Agreeableness', u'Analytical', u'Anger', u'Cheerfulness', u'Confident', u'Conscientiousness', u'Negative', u'Openness', u'Tentative'], [1.0, 0.0, 0.0, 1.0, 0.0, 0.48, 0.0, 0.02, 0.0])
    tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x])  )
    #Use custom sort function to sort the entries by order of appearance in top10tags
    def customCompare( key ):
        for (k,v) in top10tags:
            if k == key:
                return v
        return 0
    tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)
    #Take the mean tone scores for the top 10 tags
    top10tagsMeanScores = tagsRDD.take(10)
  8. Input the following visualization code to plot the data in a multi-series bar chart. It also provides a custom legend to present the data more clearly:
    %matplotlib inline
    import matplotlib
    import numpy as np
    import matplotlib.pyplot as plt
    params = plt.gcf()
    plSize = params.get_size_inches()
    params.set_size_inches( (plSize[0]*3, plSize[1]*2) )
    top5tagsMeanScores = top10tagsMeanScores[:5]
    width = 0
    (a,b) = top5tagsMeanScores[0]
    colors = ["beige", "paleturquoise", "pink", "lightyellow", "coral", "lightgreen", "gainsboro", "aquamarine","c"]
    for key, value in top5tagsMeanScores:
        plt.bar(ind + width, value[1], 0.15, color=colors[idx], label=key)
        width += 0.15
        idx += 1
    plt.xticks(ind+0.3, labels)
    plt.ylabel('AVERAGE SCORE')
    plt.title('Breakdown of top hashtags by sentiment tones')
    plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='center',ncol=5, mode="expand", borderaxespad=0.)


    breakdown by tone scores

See more: View a copy of this IPython Notebook on GitHub


In this tutorial, you learned how to:

  • build a complex Apache® Spark™ solution that integrates multiple services from Bluemix
  • load the data into Spark SQL dataframes and query the data using SQL.
  • run complex analytics using RDD transformations and actions.
  • create compelling visualizations using the powerful matplotlib Python package provided in the IPython Notebook.

This tutorial shows the power and potential of the Apache® Spark™ engine and programming model. Hopefully these examples have inspired you to run your own analytics and reports with these fast and flexible tools. If you’re interested in real-time analysis, read my follow-up on adding IBM Message Hub (Apache Kafka) into the mix.