3 text mining

Text Mining with tm

R package tm by (Ingo Feinerer)

The main package to perform text mining tasks in R is tm

Make yourself a favor and check its documentation and vignettes:

Introduction to the tm Package Text Mining in R

Extensions: How to Handle Custom File Formats

Lexical Corpus

The main structure for managing documents in tm is a so-called Corpus which represents a collection of text documents. If your textual data is in a vector object, which it will usually be when extracting information from twitter, the way to create a corpus is:

mycorpus = Corpus(VectorSource(object))

Transformations

Once we have a corpus we typically want to modify the documents in it by doing some stemming, stopword, removal, etc. These tasks can be performed in tm with the so-called transformations via the tm_map function

stripWhitespace: eliminate extra white-spaces

mycorpus1 = tm_map(mycorpus, stripWhitespace)

tolower: convert text to lower case

mycorpus2 = tm_map(mycorpus, tolower)

removeWords: remove words like stopwords

mycorpus3 = tm_map(mycorpus, removeWords, stopwords("english"))

removePunctuation: remove punctuation symbols

mycorpus4 = tm_map(mycorpus, removePunctuation)

removeNumber: remove numbers

mycorpus5 = tm_map(mycorpus, removeNumber)

Apply various transformations at the same time

tm_map(x,

Term-Document Matrices

A common approach in text mining is to create a term-document matrix from a corpus with the use of the functions:

TermDocumentMatrix create a matrix with terms as rows and documents as columns

DocumentTermMatrix create a matrix with documents as rows and terms as columns

Each one of these two types of matrices is in fact the meat-and-potatoes for most of the analysis in R because we apply classifications, cluster analysis, association analysis, and so on.

Page updated

Google Sites

Report abuse