Processing Data (deprecated)
Processing collected data
Once we got the data (either by twitteR or by XML), the next steps usually involve some type of pre-processing, manipulation, cleaning, formatting, and filtering.
A) Processing tweets data via twitteR
After collecting the desired tweets, we need to extract their contents. The most straightforward way to extract the related data in the collected tweets is by using the function twListToDF that puts everything in a data frame
Let's collect some tweets containing the term "data mining"
# collect tweets in english containing 'data mining'
tweets = searchTwitter("data mining", lang="en")
twListToDF: Dumping twitter data into a data frame
# convert tweets into a data frame
tweets_df = twListToDF(tweets)
If we don't want to use the twListToDF function, we can use the get methods
to extract the desire fields from a single element of a twitter list.
getText: extract the text content of a single element
# extract the text content of the first tweet
tweets[[1]]$getText()
getScreenName: extract the user name of a single element
# extract the user name of the first tweet
tweets[[1]]$getScreenName()
getId: extract the tweet Id number
# extract the tweet Id of the first tweet
tweets[[1]]$getId()
getCreated: extract date and time of publication of a single element
# extract date and time of publication of the first tweet
tweets[[1]]$getCreated()
getStatusSource: extract source user agent of a single element
# extract source user agent
tweets[[1]]$getStatusSource()
Most of the times, we'll want to extract specific information from all the harvested tweets.
This can be done in several ways although I prefer to use the sapply function.
# extract the text content of the all the tweets
sapply(tweets, function(x) x$getText())
# extract the user name of the all the tweets
sapply(tweets, function(x) x$getScreenName())
# extract the Id number of the all the tweets
sapply(tweets, function(x) x$getId())
# extract the date and time of publication of the all the tweets
sapply(tweets, function(x) x$getCreated())
# extract the source user agent of the all the tweets
sapply(tweets, function(x) x$getStatusSource())
B) Processing tweets data via XML
When Twitter data has been parsed via the XML package, the extraction of the information is a little bit trickier than with the functions of the twitteR package. I'm going to show you how to process the collected data but I'm not going to discuss all the details behind the functions in XML. For more info please refer to the following slides:
Extracting data from XML by Duncan Temple Lang
Course on XML Foundations by Ray Larson
When you parse Twitter via XML, the structure of a tweet entry looks like the following code. I know it may seem scary for those unexperienced users in XML but don't worry.
<entry>
<id>tag:search.twitter.com,2005:204706537694953474</id>
<published>2012-05-21T22:53:36Z</published>
<link type="text/html" href="http://twitter.com/iamdanw_links/statuses/20470" rel="alternate"/>
<title>Photo: DATAMINING (via Business Insider) http://t.co/dKwnbnLD</title>
<content type="html">Photo: <em>DATAMINING</em>
(via A Tour Of Cenovus’ Energy’s In-Situ Christina Lake Facility - Business Insider)
<a href="http://t.co/dKwnbnLD">http://t.co/dKwnbnLD</a></content>
<updated>2012-05-21T22:53:36Z</updated>
<link type="image/png" href="http://a0.twimg.com/sticky/profile_4_normal.png" rel="image"/>
<twitter:geo/>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
<twitter:source><a href="http://www.tumblr.com/" rel="nofollow">Tumblr</a></twitter:source>
<twitter:lang>en</twitter:lang>
<author>
<name>iamdanw_links (Robot Dan)</name>
<uri>http://twitter.com/iamdanw_links</uri>
</author>
</entry>
We can distinguish the main XML tags:
<id>: tweet id
<published>: date and time of publication
<link>: some link
<title>: this is the text content of the tweet
<content>: type of content
<twitter:lang>: language of the tweet
<name>: user name
Let's see an example with tweets containing the word "datamining"
First we collect some tweets
# load XML
library(XML)
# define base twitter search url (following the atom standard)
twitter_url = "http://search.twitter.com/search.atom?"
# create twitter search query to be parsed
twitter_search = paste(twitter_url, "q=datamining", sep="")
# let's parse with xmlParseDoc
tweets = xmlParseDoc(twitter_search, asText=FALSE)
In order to extract the information contained in those tags we can use the xpathSApply function.
If you check the help documentation associated to this function, you'll see that one of its arguments is the path. This argument gives the XPath expression to be evaluated.
For instance, the titles can be extracted with the following command:
# extracting titles
titles = xpathSApply(tweets, "//s:entry/s:title", xmlValue,
namespaces = c('s'='http://www.w3.org/2005/Atom')))
In this case, the path is the character string "//s:entry/s:title". With this function we are telling R to get the XML value contained in the tag <title> ... </title> which in turn is contained within the tags <entry> ... </entry>. The argument namespaces specifies the Atom Syndication Format, which is an XML vocabulary used for describing blog content.
The dates and times of publication can be extracted with
titles = xpathSApply(tweets, "//s:entry/s:published", xmlValue,
namespaces = c('s'='http://www.w3.org/2005/Atom')))
And the user names can be extracted like this
authors = xpathSApply(tweets, "//s:entry/s:author/s:name", xmlValue,
namespaces = c('s'='http://www.w3.org/2005/Atom')))
© Gaston Sanchez - 2012