2 processing

P~~rocessing~~ D~~ata~~ (deprecated)

Processing collected data

Once we got the data (either by twitteR or by XML), the next steps usually involve some type of pre-processing, manipulation, cleaning, formatting, and filtering.

A) Processing tweets data via twitteR

After collecting the desired tweets, we need to extract their contents. The most straightforward way to extract the related data in the collected tweets is by using the function twListToDF that puts everything in a data frame

Let's collect some tweets containing the term "data mining"

# collect tweets in english containing 'data mining'

tweets = searchTwitter("data mining", lang="en")

twListToDF: Dumping twitter data into a data frame

# convert tweets into a data frame

tweets_df = twListToDF(tweets)

If we don't want to use the twListToDF function, we can use the get methods

to extract the desire fields from a single element of a twitter list.

getText: extract the text content of a single element

# extract the text content of the first tweet

tweets[[1]]$getText()

getScreenName: extract the user name of a single element

# extract the user name of the first tweet

tweets[[1]]$getScreenName()

getId: extract the tweet Id number

# extract the tweet Id of the first tweet

tweets[[1]]$getId()

getCreated: extract date and time of publication of a single element

# extract date and time of publication of the first tweet

tweets[[1]]$getCreated()

getStatusSource: extract source user agent of a single element

# extract source user agent

tweets[[1]]$getStatusSource()

Most of the times, we'll want to extract specific information from all the harvested tweets.

This can be done in several ways although I prefer to use the sapply function.

# extract the text content of the all the tweets

sapply(tweets, function(x) x$getText())

# extract the user name of the all the tweets

sapply(tweets, function(x) x$getScreenName())

# extract the Id number of the all the tweets

sapply(tweets, function(x) x$getId())

# extract the date and time of publication of the all the tweets

sapply(tweets, function(x) x$getCreated())

# extract the source user agent of the all the tweets

sapply(tweets, function(x) x$getStatusSource())

B) Processing tweets data via XML

When Twitter data has been parsed via the XML package, the extraction of the information is a little bit trickier than with the functions of the twitteR package. I'm going to show you how to process the collected data but I'm not going to discuss all the details behind the functions in XML. For more info please refer to the following slides:

Extracting data from XML by Duncan Temple Lang

Course on XML Foundations by Ray Larson

When you parse Twitter via XML, the structure of a tweet entry looks like the following code. I know it may seem scary for those unexperienced users in XML but don't worry.

<entry>

<id>tag:search.twitter.com,2005:204706537694953474</id>

<title>Photo: DATAMINING (via Business Insider) http://t.co/dKwnbnLD</title>

<content type="html">Photo: <em>DATAMINING</em>

(via A Tour Of Cenovus’ Energy’s In-Situ Christina Lake Facility - Business Insider)

<a href="http://t.co/dKwnbnLD">http://t.co/dKwnbnLD</a></content>

<twitter:geo/>

<twitter:metadata>

<twitter:result_type>recent</twitter:result_type>

</twitter:metadata>

<twitter:source><a href="http://www.tumblr.com/" rel="nofollow">Tumblr</a></twitter:source>

<twitter:lang>en</twitter:lang>

<name>iamdanw_links (Robot Dan)</name>

<uri>http://twitter.com/iamdanw_links</uri>

</author>

</entry>

We can distinguish the main XML tags:

<id>: tweet id

<published>: date and time of publication

<link>: some link

<title>: this is the text content of the tweet

<content>: type of content

<twitter:lang>: language of the tweet

<name>: user name

Let's see an example with tweets containing the word "datamining"

First we collect some tweets

# load XML

library(XML)

# define base twitter search url (following the atom standard)

twitter_url = "http://search.twitter.com/search.atom?"

# create twitter search query to be parsed

twitter_search = paste(twitter_url, "q=datamining", sep="")

# let's parse with xmlParseDoc

tweets = xmlParseDoc(twitter_search, asText=FALSE)

In order to extract the information contained in those tags we can use the xpathSApply function.

If you check the help documentation associated to this function, you'll see that one of its arguments is the path. This argument gives the XPath expression to be evaluated.

For instance, the titles can be extracted with the following command:

# extracting titles

titles = xpathSApply(tweets, "//s:entry/s:title", xmlValue,

namespaces = c('s'='http://www.w3.org/2005/Atom')))

In this case, the path is the character string "//s:entry/s:title". With this function we are telling R to get the XML value contained in the tag <title> ... </title> which in turn is contained within the tags <entry> ... </entry>. The argument namespaces specifies the Atom Syndication Format, which is an XML vocabulary used for describing blog content.

The dates and times of publication can be extracted with

titles = xpathSApply(tweets, "//s:entry/s:published", xmlValue,

namespaces = c('s'='http://www.w3.org/2005/Atom')))

And the user names can be extracted like this

authors = xpathSApply(tweets, "//s:entry/s:author/s:name", xmlValue,

namespaces = c('s'='http://www.w3.org/2005/Atom')))

Page updated

Google Sites

Report abuse