R code‎ > ‎

3 text corpus


How to create a text corpus
Now it's time to do some text analysis with the help of the R package tm. We already used the function stopwords from this package when we did the pre-processing and cleaning of the titles, but now we are going to use other functions.
Let's assume that the clean titles are in the object (i.e. vector) titles.clean. If you haven't cleaned your titles, please check the preprocessing phase.

Step 1
The first step is to create what is called a corpus with the titles. To do so we'll use the function Corpus. In addition, since our titles are in the vector titles.clean, we also need to use the function VectorSource to indicate that our data is in vector format.
# create corpus
title.corpus = Corpus(VectorSource(titles.clean))

Step 2
Once we created the corpus, the next step is to create a term document matrix with the function TermDocumentMatrix
# create a term document matrix
tdm = TermDocumentMatrix(title.corpus)

Step 3
The third step is to remove sparse terms from the term document matrix. In other words, we'll remove terms which have at least a sparse percentage of empty  elements (e.g. terms occurring 0 times in a document).
# remove sparse terms
tdm = removeSparseTerms(tdm, 0.995)

Step 4
Although tdm is supposed to be a matrix (in the non-R sense of the word), the problem is that tdm is not an object of class matrix. The final step is to define the object tdm as a matrix
# define tdm as matrix
m = as.matrix(tdm)


Comments