Tweets Containing #hashtags user tweets
Checking the hashtags
It is widely accepted that tweets containing hashtags are more valuable than ones that don't because someone has deliberately gone to the process of embedding aggregatable information into those tweets. Applying the concepts of frequency analysis we could compute things like the average number of hashtags per tweet, or the average length of hashtags, for example. But since I love to visualize data, nothing more tempting than producing some word-tags. Let's see a simple example.
What do @user1, @user2, and @user3 hashtag about?
Let's say we want to know more about the tweets of three given users
For example, consider the twitter accounts of three USA government federal agencies:
@EPAgov (Environmental Protection Agency)
@NIHforHealth (National Institues of Health)
@CDCgov (Center for Disease Control and Prevention)
Step 1: Load the required packages
# load packages
library(stringr)
library(wordcloud)
Step 2: Collect tweets from each @user
# harvest tweets from each user
epa_tweets = userTimeline("EPAgov", n=500)
nih_tweets = userTimeline("NIHforHealth", n=500)
cdc_tweets = userTimeline("CDCgov", n=500)
# dump tweets information into data frames
epa_df = twListToDF(epa_tweets)
nih_df = twListToDF(nih_tweets)
cdc_df = twListToDF(cdc_tweets)
Step 3: Let's see what hashtags do they use
# get the hashtags
epa_hashtags = str_extract_all(epa_df$text, "#\\w+")
nih_hashtags = str_extract_all(nih_df$text, "#\\w+")
cdc_hashtags = str_extract_all(cdc_df$text, "#\\w+")
# put tags in vector
epa_hashtags = unlist(epa_hashtags)
nih_hashtags = unlist(nih_hashtags)
cdc_hashtags = unlist(cdc_hashtags)
# calculate hashtag frequencies
epa_tags_freq = table(epa_hashtags)
nih_tags_freq = table(nih_hashtags)
cdc_tags_freq = table(cdc_hashtags)
# put all tags in a single vector
all_tags = c(epa_tags_freq, nih_tags_freq, cdc_tags_freq)
Step 4: Let's plot wordclouds for each user
# EPA hashtags wordcloud
wordcloud(names(epa_tags_freq), epa_tags_freq, random.order=FALSE,
colors="#1B9E77")
title("\n\nHashtags in tweets from @EPAgov",
cex.main=1.5, col.main="gray50")
# NIH hashtags wordcloud
wordcloud(names(nih_tags_freq), nih_tags_freq + 7, random.order=FALSE,
colors="#7570B3")
title("\nHashtags in tweets from @NIHforHealth",
cex.main=1.5, col.main="gray50")
# CDC hashtags wordcloud
wordcloud(names(cdc_tags_freq), cdc_tags_freq, random.order=FALSE,
colors="#D95F02")
title("\n\nHashtags in tweets from @CDCgov",
cex.main=1.5, col.main="gray50")
Step 5: Now let's plot one single wordcloud
# vector of colors
cols = c(
rep("#1B9E77", length(epa_tags_freq)),
rep("#7570B3", length(nih_tags_freq)),
rep("#D95F02", length(cdc_tags_freq))
)
# wordcloud
wordcloud(names(all_tags), all_tags, random.order=FALSE, min.freq=1,
colors=cols, ordered.colors=TRUE)
mtext(c("@EPAgov", "@NIHforHealth", "@CDCgov"), side=3,
line=2, at=c(0.25, 0.5, 0.75), col=c("#1B9E77", "#7570B3", "#D95F02"),
family="serif", font=2, cex=1.5)
© Gaston Sanchez