2 frequencies

Frequency Analysis

What are the frequencies in the data?

Any decent data analysis project requires to spend a fair amount of getting to know the data, calculating summary statistics, checking distributions, and performing exploratory analysis.

Analyzing Twitter data is not the exception, and we need to to tackle questions such as:

What is the average number of words per tweet?

What is the average word-length?

What is the number of hashtags per tweet?

What is the lexical diversity of tweets?

What are the most frequent words / terms?

One of the simplest techniques you could apply to answer this question is basic frequency analysis

Example: Ice, Ice... "icecream"

We'll use a toy example searching for tweets about "icecream"

To keep things simple, we'll do the frequency analysis on the

extracted tweets without doing any data cleaning

Step 1: Load the required packages

# load packages

library(XML)

library(tm)

library(ggplot2)

Step 2: Let's get some tweets containing the term "icecream"

# define twitter search url (following the atom standard)

twitter_url = "http://search.twitter.com/search.atom?"

# vector to store results

results = character(0)

# paginate 20 times to harvest tweets

for (page in 1:20)

{

# create twitter search query to be parsed

# tweets in english containing 'icecream'

twitter_search = paste(twitter_url, "q=icecream",

"&rpp=100&lang=en&page", page, sep="")

# let's parse with xmlParseDoc

tmp = xmlParseDoc(twitter_search, asText=FALSE)

# extract titles

results = c(results, xpathSApply(tmp, "//s:entry/s:title", xmlValue,

namespaces=c('s'='http://www.w3.org/2005/Atom')))

}

# how many tweets

length(results)

Step 3.1: how many characters per tweet?

# characters per tweet

chars_per_tweet = sapply(results, nchar)

summary(chars_per_tweet)

Step 3.2: how many words per tweets

# split words

words_list = strsplit(results, " ")

# words per tweet

words_per_tweet = sapply(words_list, length)

# barplot

barplot(table(words_per_tweet), border=NA,

main="Distribution of words per tweet", cex.main=1)

# length of words per tweet

wsize_per_tweet = sapply(words_list, function(x) mean(nchar(x)))

# barplot

barplot(table(round(wsize_per_tweet)), border=NA,

xlab = "word length in number of characters",

main="Distribution of words length per tweet", cex.main=1)

Step 3.3: how many unique words per tweets

# how many unique words per tweet

uniq_words_per_tweet = sapply(words_list, function(x) length(unique(x)))

# barplot

barplot(table(uniq_words_per_tweet), border=NA,

main="Distribution of unique words per tweet", cex.main=1)

Step 3.4: how many hashtags per tweet?

# how many hashtags per tweet

hash_per_tweet = sapply(words_list, function(x) length(grep("#", x)))

table(hash_per_tweet)

prop.table(table(hash_per_tweet))

Step 3.5: how many @mentions per tweet?

# how many @mentions per tweet

ats_per_tweet = sapply(words_list, function(x) length(grep("@", x)))

table(ats_per_tweet)

prop.table(table(ats_per_tweet))

Step 3.6: how many http links per tweet?

# how many http links per tweet

links_per_tweet = sapply(words_list, function(x) length(grep("http", x)))

table(links_per_tweet)

prop.table(table(links_per_tweet))

Step 4: let's create a data frame with all the calculated stuff and make some plots

# data frame

icedf = data.frame(

chars=chars_per_tweet,

words = words_per_tweet,

lengths = wsize_per_tweet,

uniqs = uniq_words_per_tweet,

hashs = hash_per_tweet,

ats = ats_per_tweet,

links = links_per_tweet

)

The more words in a tweet, the more characters per word

# words -vs- chars

ggplot(icedf, aes(x=words, y=chars)) +

geom_point(colour="gray20", alpha=0.2) +

stat_smooth(method="lm") +

labs(x="number of words per tweet", y="number of characters per tweet") +

opts(title = "Tweets about 'icecream' \nNumber of words -vs- Number of characters",

plot.title = theme_text(size=12))

The more words in a tweet, the shorter the words

# words -vs- word length

ggplot(icedf, aes(x=words, y=lengths)) +

geom_point(colour="gray20", alpha=0.2) +

stat_smooth(method="lm") +

labs(x="number of words per tweet", y="size of words per tweet") +

opts(title = "Tweets about 'icecream' \nNumber of words -vs- Length of words",

plot.title = theme_text(size=12))

Step 5: Lexical diversity: number of unique tokens / number of total tokens

# unique words in total

uniq_words = unique(unlist(words_list))

# lexical diversity

length(uniq_words) / length(unlist(words_list))

The lexical diversity reflects the range of diversity in vocabulary.

In this case

Step 6: What are the most frequent words

# most frequent words

mfw = sort(table(unlist(words_list)), decreasing=TRUE)

# top-20 most frequent

top20 = head(mfw, 20)

# barplot

barplot(top20, border=NA, las=2, main="Top 20 most frequent terms", cex.main=1)

Page updated

Google Sites

Report abuse