Frequency Analysis
What are the frequencies in the data?
Any decent data analysis project requires to spend a fair amount of getting to know the data, calculating summary statistics, checking distributions, and performing exploratory analysis.
Analyzing Twitter data is not the exception, and we need to to tackle questions such as:
What is the average number of words per tweet?
What is the average word-length?
What is the number of hashtags per tweet?
What is the lexical diversity of tweets?
What are the most frequent words / terms?
One of the simplest techniques you could apply to answer this question is basic frequency analysis
Example: Ice, Ice... "icecream"
We'll use a toy example searching for tweets about "icecream"
To keep things simple, we'll do the frequency analysis on the
extracted tweets without doing any data cleaning
Step 1: Load the required packages
# load packages
library(XML)
library(tm)
library(ggplot2)
Step 2: Let's get some tweets containing the term "icecream"
# define twitter search url (following the atom standard)
twitter_url = "http://search.twitter.com/search.atom?"
# vector to store results
results = character(0)
# paginate 20 times to harvest tweets
for (page in 1:20)
{
# create twitter search query to be parsed
# tweets in english containing 'icecream'
twitter_search = paste(twitter_url, "q=icecream",
"&rpp=100&lang=en&page", page, sep="")
# let's parse with xmlParseDoc
tmp = xmlParseDoc(twitter_search, asText=FALSE)
# extract titles
results = c(results, xpathSApply(tmp, "//s:entry/s:title", xmlValue,
namespaces=c('s'='http://www.w3.org/2005/Atom')))
}
# how many tweets
length(results)
Step 3.1: how many characters per tweet?
# characters per tweet
chars_per_tweet = sapply(results, nchar)
summary(chars_per_tweet)
Step 3.2: how many words per tweets
# split words
words_list = strsplit(results, " ")
# words per tweet
words_per_tweet = sapply(words_list, length)
# barplot
barplot(table(words_per_tweet), border=NA,
main="Distribution of words per tweet", cex.main=1)
# length of words per tweet
wsize_per_tweet = sapply(words_list, function(x) mean(nchar(x)))
# barplot
barplot(table(round(wsize_per_tweet)), border=NA,
xlab = "word length in number of characters",
main="Distribution of words length per tweet", cex.main=1)
Step 3.3: how many unique words per tweets
# how many unique words per tweet
uniq_words_per_tweet = sapply(words_list, function(x) length(unique(x)))
# barplot
barplot(table(uniq_words_per_tweet), border=NA,
main="Distribution of unique words per tweet", cex.main=1)
Step 3.4: how many hashtags per tweet?
# how many hashtags per tweet
hash_per_tweet = sapply(words_list, function(x) length(grep("#", x)))
table(hash_per_tweet)
prop.table(table(hash_per_tweet))
Step 3.5: how many @mentions per tweet?
# how many @mentions per tweet
ats_per_tweet = sapply(words_list, function(x) length(grep("@", x)))
table(ats_per_tweet)
prop.table(table(ats_per_tweet))
Step 3.6: how many http links per tweet?
# how many http links per tweet
links_per_tweet = sapply(words_list, function(x) length(grep("http", x)))
table(links_per_tweet)
prop.table(table(links_per_tweet))
Step 4: let's create a data frame with all the calculated stuff and make some plots
# data frame
icedf = data.frame(
chars=chars_per_tweet,
words = words_per_tweet,
lengths = wsize_per_tweet,
uniqs = uniq_words_per_tweet,
hashs = hash_per_tweet,
ats = ats_per_tweet,
links = links_per_tweet
)
The more words in a tweet, the more characters per word
# words -vs- chars
ggplot(icedf, aes(x=words, y=chars)) +
geom_point(colour="gray20", alpha=0.2) +
stat_smooth(method="lm") +
labs(x="number of words per tweet", y="number of characters per tweet") +
opts(title = "Tweets about 'icecream' \nNumber of words -vs- Number of characters",
plot.title = theme_text(size=12))
The more words in a tweet, the shorter the words
# words -vs- word length
ggplot(icedf, aes(x=words, y=lengths)) +
geom_point(colour="gray20", alpha=0.2) +
stat_smooth(method="lm") +
labs(x="number of words per tweet", y="size of words per tweet") +
opts(title = "Tweets about 'icecream' \nNumber of words -vs- Length of words",
plot.title = theme_text(size=12))
Step 5: Lexical diversity: number of unique tokens / number of total tokens
# unique words in total
uniq_words = unique(unlist(words_list))
# lexical diversity
length(uniq_words) / length(unlist(words_list))
The lexical diversity reflects the range of diversity in vocabulary.
In this case
Step 6: What are the most frequent words
# most frequent words
mfw = sort(table(unlist(words_list)), decreasing=TRUE)
# top-20 most frequent
top20 = head(mfw, 20)
# barplot
barplot(top20, border=NA, las=2, main="Top 20 most frequent terms", cex.main=1)
© Gaston Sanchez