Text Mining with tm
R package tm by (Ingo Feinerer)
The main package to perform text mining tasks in R is tm
Make yourself a favor and check its documentation and vignettes:
Text Mining Infrastructure in R
Introduction to the tm Package Text Mining in R
Extensions: How to Handle Custom File Formats
Lexical Corpus
The main structure for managing documents in tm is a so-called Corpus which represents a collection of text documents. If your textual data is in a vector object, which it will usually be when extracting information from twitter, the way to create a corpus is:
mycorpus = Corpus(VectorSource(object))
Transformations
Once we have a corpus we typically want to modify the documents in it by doing some stemming, stopword, removal, etc. These tasks can be performed in tm with the so-called transformations via the tm_map function
stripWhitespace: eliminate extra white-spaces
mycorpus1 = tm_map(mycorpus, stripWhitespace)
tolower: convert text to lower case
mycorpus2 = tm_map(mycorpus, tolower)
removeWords: remove words like stopwords
mycorpus3 = tm_map(mycorpus, removeWords, stopwords("english"))
removePunctuation: remove punctuation symbols
mycorpus4 = tm_map(mycorpus, removePunctuation)
removeNumber: remove numbers
mycorpus5 = tm_map(mycorpus, removeNumber)
Apply various transformations at the same time
tm_map(x,
Term-Document Matrices
A common approach in text mining is to create a term-document matrix from a corpus with the use of the functions:
TermDocumentMatrix create a matrix with terms as rows and documents as columns
DocumentTermMatrix create a matrix with documents as rows and terms as columns
Each one of these two types of matrices is in fact the meat-and-potatoes for most of the analysis in R because we apply classifications, cluster analysis, association analysis, and so on.
© Gaston Sanchez