Basic Sentiment Analysis in R sentiments
Breen's Approach
One option to perform sentiment analysis in R is by following what I call the Breen's approach
named after Jeffrey Breen's seminal elucidating slides on twitter sentiment analysis with R
http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/
The general idea is to calculate a sentiment score for each tweet so we can know
how positive or negative is the posted message.
There are different ways to calculate such scores,
and you can even create your own formula.
We'll use a very simple yet useful approach to define our score formula
Score = Number of positive words - Number of negative words
If Score > 0, this means that the sentence has an overall 'positive opinion'
If Score < 0, this means that the sentence has an overall 'negative opinion'
If Score = 0, then the sentence is considered to be a 'neutral opinion'
In order to count the number of positive and negative words, we need a very important ingredient:
an opinion lexicon in english, which fortunately it is provided by Hu and Liu and it can be accessed from: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
Washington, USA,
Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing
and Comparing Opinions on the Web." Proceedings of the 14th
International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
You can download the text files containing the positive and negative words:
positive_words.txt
negative_words.txt
Another important ingredient, shared by Jeff Breen, is the very handy function to calculate score sentiments: check Breen's github repo on sentiment analysis for more details.
Example: Mood and Drinking
Let me show you a simple example of some of the things we can do with sentiment analysis.
Research Question: What's the mood associated with tweets containing some kind of drink? More specifically, what's the mood associated to drinks such as wine, beer, coffee and soda?
Step 1: Load necessary packages
library(twitteR)
library(plyr)
library(stringr)
Step 2: Define function score.sentiment
# function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
# Parameters
# sentences: vector of text to score
# pos.words: vector of words of postive sentiment
# neg.words: vector of words of negative sentiment
# .progress: passed to laply() to control of progress bar
# create simple array of scores with laply
scores = laply(sentences,
function(sentence, pos.words, neg.words)
{
# remove punctuation
sentence = gsub("[[:punct:]]", "", sentence)
# remove control characters
sentence = gsub("[[:cntrl:]]", "", sentence)
# remove digits?
sentence = gsub('\\d+', '', sentence)
# define error handling function when trying tolower
tryTolower = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
# use tryTolower with sapply
sentence = sapply(sentence, tryTolower)
# split sentence into words with str_split (stringr package)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
# compare words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# get the position of the matched term or NA
# we just want a TRUE/FALSE
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# final score
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
# data frame with scores for each sentence
scores.df = data.frame(text=sentences, score=scores)
return(scores.df)
}
Step 3: We need to import the files containing the positive and negative words
# import positive and negative words
pos = readLines("positive_words.txt")
neg = readLines("negative_words.txt")
Step 4: Let's harvest tweets talking about wine, beer, coffee, and soda
# tweets with drinks
wine_tweets = searchTwitter("wine", n=500, lang="en")
beer_tweets = searchTwitter("beer", n=500, lang="en")
cofe_tweets = searchTwitter("coffee", n=500, lang="en")
soda_tweets = searchTwitter("soda", n=500, lang="en")
# get text
wine_txt = sapply(wine_tweets, function(x) x$getText())
beer_txt = sapply(beer_tweets, function(x) x$getText())
cofe_txt = sapply(cofe_tweets, function(x) x$getText())
soda_txt = sapply(soda_tweets, function(x) x$getText())
# how many tweets of each drink
nd = c(length(wine_txt), length(beer_txt), length(cofe_txt), length(soda_txt))
# join texts
drinks = c(wine_txt, beer_txt, cofe_txt, soda_txt)
Step 5: Apply score.sentiment and calculate more results
# apply function score.sentiment
scores = score.sentiment(ari_txt, pos, neg, .progress='text')
# add variables to data frame
scores$drink = factor(rep(c("wine", "beer", "coffee", "soda"), nd))
scores$very.pos = as.numeric(scores$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
# how many very positives and very negatives
numpos = sum(scores$very.pos)
numneg = sum(scores$very.neg)
# global score
global_score = round( 100 * numpos / (numpos + numneg) )
Step 6: Get a boxplot
# colors
cols = c("#7CAE00", "#00BFC4", "#F8766D", "#C77CFF")
names(cols) = c("beer", "coffee", "soda", "wine")
# boxplot
ggplot(scores, aes(x=drink, y=score, group=drink)) +
geom_boxplot(aes(fill=drink)) +
scale_fill_manual(values=cols) +
geom_jitter(colour="gray40",
position=position_jitter(width=0.2), alpha=0.3) +
opts(title = "Boxplot - Drink's Sentiment Scores")
Step 7: Make some barplots
As you can tell, wine gets the highest sentiment score, while soda the lowest one
# barplot of average score
meanscore = tapply(scores$score, scores$drink, mean)
df = data.frame(drink=names(meanscore), meanscore=meanscore)
df$drinks <- reorder(df$drink, df$meanscore)
ggplot(df, aes(y=meanscore)) +
geom_bar(data=df, aes(x=drinks, fill=drinks)) +
scale_fill_manual(values=cols[order(df$meanscore)]) +
opts(title = "Average Sentiment Score",
legend.position = "none")
If we examine the very positive scores, we'll see that wine receives the highest values
# barplot of average very positive
drink_pos = ddply(scores, .(drink), summarise, mean_pos=mean(very.pos))
drink_pos$drinks <- reorder(drink_pos$drink, drink_pos$mean_pos)
ggplot(drink_pos, aes(y=mean_pos)) +
geom_bar(data=drink_pos, aes(x=drinks, fill=drinks)) +
scale_fill_manual(values=cols[order(drink_pos$mean_pos)]) +
opts(title = "Average Very Positive Sentiment Score",
legend.position = "none")
Conversely, if we check the very negative scores, soda is the one that has the worst score
# barplot of average very negative
drink_neg = ddply(scores, .(drink), summarise, mean_neg=mean(very.neg))
drink_neg$drinks <- reorder(drink_neg$drink, drink_neg$mean_neg)
ggplot(drink_neg, aes(y=mean_neg)) +
geom_bar(data=drink_neg, aes(x=drinks, fill=drinks)) +
scale_fill_manual(values=cols[order(drink_neg$mean_neg)]) +
opts(title = "Average Very Negative Sentiment Score",
legend.position = "none")
© Gaston Sanchez - 2012