Lesson 3

Text Analysis Using RStudio

Photo by Rubria Rocha de Luna

This is lesson 3 of 3 in the educational series on Web Scraping and Text Analysis in Bilingual Social Media. This lesson is intended to teach you some basic RStudio concepts, uses, and commands so you can prepare the data for analysis and perform text analysis of a corpus. We will go over some basic syntax and the basic command structure by revising basic commands for text pre-processing, text analysis and plotting graphs.

Audience: Learners

Use case: Tutorial (Learning-oriented)

A carefully constructed example that takes the user by the hand through a series of steps to learn how a process works. Tutorials often use "toy" (or at least carefully constrained) examples that give reliable, accurate, and repeatable results every time.

(https://constellate.org/docs/documentation-categories)

Difficulty: Beginner

Beginner assumes users are new to Facepager and RStudio. The user is helped step-by-step with explanatory text and examples. If you are a person who does not know how or where to begin web scraping and you have no experience on cleaning data and on coding for text analysis, this is a course for you. You will find step by step instructions and the simple code you need to run a text analysis based on word frequencies.

Completion time: 90 minutes

Knowledge Required:

Generalities of the RStudio interface

Be able to open and quit a session, set a working directory, and create, open, save and close a file in RStudio.

Knowledge Recommended:

By reviewing a list of packages and what they do, you can get a better idea of what can be done in RStudio.

* Packages in RStudio

https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages

Required Software

R and RStudio for performing text analysis

Installation instructions for R

R installation instructions.pdf

Installation instructions for RStudio

RStudio installation instructions.pdf

Required Data

You will need some of the files from the "tapiwebscraping" folder.

Data Description:

This lesson uses the .txt file and the .r file we created at the end of lesson 2. Also, we will take a look at some of the files in that folder and create a new .r file to perform the text analysis and to plot a graph with a cleaned .txt file.

Download Required Data

You have created a folder named "tapiwebscraping" for your desktop. Download the files below in case you have to update your folder.

Lesson

1) Introduction

library(tm)

library(NLP)

library(SnowballC)

library(wordcloud)

library(RColorBrewer)

library(ggplot2)

library(dplyr)

library(readr)

library(cluster)

#In the console you will see what we fixed in lesson 2:

#Sintax

#In R the most basic command follow this order:

(Assigner)

#Object <- Function ("Argument 1", "Argument 2", "Argument 3")

#Now we are going to create an object. For this lesson we are going to use the "odaclean.txt file which is the one we extracted since lesson 1. Then, we will use the read_lines function from the "readr" library.

oda_raw <- read_lines("odaclean.txt")

# Now, lets type the name of the object and run.

oda_raw

#In the console you will see the complete text because this .txt file contains only 103 posts.

#The oda_raw object is a character "chr" type with 103 elements.

#The "string" function str() in R Language is used for compactly displaying the internal structure of the object.

str(oda_raw)

#In the console you will see:

#Creating paragraphs

#Since posts is social media, and particularly in our text, we are going to create a vector called: diez (10) with ten repetitions (rep) of the numbers from 1 to the number of lines in the document divided by 10 (length(oda_raw)/10.

#By doing this, we will have a vector of groups of 10 lines (posts) each until the end of the 103 posts.

#diez <- rep(1:ceiling(length(oda_raw)/10), each = 10)

#This vector may not have the same number of lines per elements since each post has a different lenght.

#diez <- diez[1:length(oda_raw)]

#Now, we will create a new object (oda_text), so we will assign these groups of 10 to oda_text. We will have a column with the lines of text and another one with a number that identify each group.

#Also, we will make a data.frame so that columns can be identified with a name that will be helpful in the following steps.

#We use aggregate to concatenate the lines (posts) (FUN = paste, with collapse = " " to preserve the white space between words.

#What we are going to have now is a one column with the paragraphs and the number of the group.

#We will also transform oda_text into a matrix, since this will help us in the next steps.

oda_text <-

cbind(

rep(1:ceiling(length(oda_raw)/10), each = 10) %>%

.[1:length(oda_raw)],

oda_raw

) %>%

data.frame %>%

aggregate(

oda_raw ~ V1,

data = .,

FUN = paste,

collapse=" ") %>%

select(oda_raw) %>%

as.matrix

dim(oda_text)

#In the console you will see:

#Meaning that we have 11 groups.

#Type:

oda_text

#In the console you will see the divided paragraphs.

2) Pre-processing

#Now, we are going to continue with the pre-processing by removing special characters using regex. Regex will help us identify special characters such as: line breaks and tabs. We will use the regex [[:cntrl:]]

oda_text <- gsub("[[:cntrl:]]", " ", oda_text)

oda_text

#We need to get rid of some of the words such as: "http", "https", "bitly"... are there any more words to remove?

oda_text <- removeWords(oda_text, words = c("http", "https", "bitly"))

#Next, let's transform all the text to lower case letters.

oda_text <- tolower(oda_text)

oda_text

#Let's use removeWords with the stopwords("spanish") to remove the word from this langage that carry very little useful information.Example of these words are: prepositions and fillers. Then, we will do the same for the English Stop Words. Remember that these posts are written in both languages.

oda_text <- removeWords(oda_text, words = stopwords("spanish"))

oda_text <- removeWords(oda_text, words = stopwords("english"))

oda_text

#Then, we will remove punctuation.

oda_text <- removePunctuation(oda_text)

oda_text

#EIn this case, we will remove numbers, since we are not interested in dates or any other numerical information.

oda_text <- removeNumbers(oda_text)

oda_text

#Next, we will remove the extra white spaces between words. Some of them were produced after the changes we made.

oda_text <- stripWhitespace(oda_text)

oda_text

#In the console you will see a cleaner text:

3) Analysis of the Corpus

#Now that the text is prepared for the analysis, we will create a corpus. This corpus is made up of all the paragraphs. This corpus will be assigned to a new object called: oda_corpus. We will use the VectorSource and Corpus functions for this purpose.

oda_corpus <- VCorpus(VectorSource(oda_text))

oda_corpus

#In the console you will see:

#WordCloud

#Next, we will map our corpus. We will use the tm_map and PlainTextDocument functions for this purpose. We create a new object to assign these functions.

oda_ptd <- tm_map(oda_corpus, PlainTextDocument)

#Now, we will be able to create a word cloud using the library of this same name. This word cloud will have the most frequent words of the corpus.

wordcloud(oda_ptd, max.words = 80, random.order = F, colors = brewer.pal(name = "Dark2", n = 8))

#In the plot window you will see:

# Do you see any more more words that need to be removed?

# If your answer is yes, we can proceed to a second round for cleaning using the removeWords function.

# This function requires the vector of characters that we used earlier (oda_text).

oda_text <- removeWords(oda_text, words = c("zwgmj"))

#Once we have cleaned the oda_text for the second time. We have to use this new version of the oda_text to generate a new corpus and map it again.

oda_corpus <- oda_text %>% VectorSource() %>% VCorpus()

oda_ptd <- oda_corpus %>% tm_map(PlainTextDocument)

#Next, we will create a new word cloud, that may be different.

wordcloud(oda_ptd, max.words = 30, random.order = F, colors=brewer.pal(name = "Dark2", n = 8))

#In the plot window you will see:

#Term Document Matrix

#Now, we will map the corpus as a matrix, so that in the future steps we can identify correlations between words.

#The Document Term Matrix lists all occurrences of words in the corpus and by document. In the matrix we can see the documents in rows and the words in columns. This matrix will assign a number according to the number of a word occurs in a particular document. If it does not appear in the document, then the entry is "0", if it is only one word in the document, then "1" and if the word is twice in the same document, then, the number will be "2".

# We will use the function TermDocumentMatrix in the corpus and assign the result to a new object called: oda_tdm.

oda_tdm <- TermDocumentMatrix(oda_corpus)

oda_tdm

#In the console you will see:

#According to the resulting information we have 1408 terms in the 11 documents (paragraphs). This means that there are 1408 unique words in the corpus.

Non-/ Sparce entries: 2517/12971

This means that in 2517 entries there are words (1408 of them appear only once in the corpus / while the sparce entries are 12971 which means that in that number of entries we may find zeros. In this case, 84% of the matrix entries are zeros.

#Word frequency

#Now, it is time to prepare the corpus to see quantities of the frequency of words. We need to transform the oda_tdm object in a matrix object. This object will have the number of unique words as lines and the number of documents as columns.

oda_mat <- as.matrix(oda_tdm)

dim(oda_mat)

#In the console you will see:

#Now, we will get the sum of lines (rowSums) ordered from highest to lowest sort(decreasing = TRUE) to learn the frequency of each word. After that, we will transform the results to a new object of data.frame class, so we can have two columns, word and frequency in a table.

oda_mat <- oda_mat %>% rowSums() %>% sort(decreasing = TRUE)

oda_mat <- data.frame(palabra = names(oda_mat), frec = oda_mat)

#With the matrix object we can also create a word cloud.

wordcloud(

words = oda_mat$palabra,

freq = oda_mat$frec,

max.words = 70,

random.order = F,

colors=brewer.pal(name = "Dark2", n = 8))

#In the plot window you will see:

#Next, let's type:

oda_mat[1:20, ]

#In the console you will see the table of frequencies:

4) Frequency Graphs

#Frequency bar charts

We will create a bar chart with the number of frequency of the words. For that purpose, we will use ggplot2. Ggplot has its oun functions that we will ot see here, but it is important to mention that we are using the oda_mat for this purpose.

oda_mat[1:20, ] %>%

ggplot(aes(palabra, frec)) +

geom_bar(stat = "identity", color = "black", fill = "#87CEFA") +

geom_text(aes(hjust = 1.3, label = frec)) +

coord_flip() +

labs(title = "Twenty most frequent words", x = "Words", y = "Number of frequencies")

#In the plot window you will see:

#Now, instead of using the number of frequencies, we are going to use percentage of use. For this purpose we will use the dplyr library.

oda_mat %>%

mutate(perc = (frec/sum(frec))*100) %>%

.[1:20, ] %>%

ggplot(aes(palabra, perc)) +

geom_bar(stat = "identity", color = "black", fill = "#F5B041") +

geom_text(aes(hjust = 1.3, label = round(perc, 2))) +

coord_flip() +

labs(title = "Ten most frequent words", x = "Words", y = "Percentage")

#In the plot window you will see:

5) Correlation between words

We are going to use a vector, so we are going to be able to introduce more than one word in the command.

The function we will use is findAssocs. First we will use only one word. In this case, let's find the correlation of "join"
with other words from the text.

The correlation (corlimit) near zero means that the words do not correlate. On the contrary, if the values is close to 1, means that the terms correlate.

Let's use the .7 and see what happens:

findAssocs(oda_tdm, terms = c("join"), corlimit = .7)

#In the console you will see:

#As you can see there are 9 words that higly correlate with join. You may change the corlimit to see other correlations. Also you can find more associations of the most frequent words and use a different corlimit and see what happens. Also, you will decide the corlimit according to your own research interests.

Congratulations!!! You have completed a basic text analysis and you have created word clouds and graphics to plot the data from extracted text. YAY! You are ready to work by yourself and extract data of your interest, clean it and analyze it. Best of luck!

Exercise

In case you have not created a new folder in your desktop to save the files from the Exercises1 and 2, please do, so.

Open in RStudio your .txt file you saved from Exercise 2. Don't forget to Set the Working Directory to the folder where this .txt file is located.

Pre-processing
Following the commands we saw in Lesson 3, complete the pre-processing of the text. Pay special attention to the resulting text and decide if you need to remove other words that are not in the Stop words lists.

Text Analysis

Ask for the 10 most frequent words

Graph

Make a word cloud and a bar chart of the most frequent words.

Questions:

What words did you remove and why?
What was your resulting list of Most frecuent words?
In what part of the code did you changed the number of most frequent words to get the list?

Solution

What words did you remove and why? " R= "facebook", "http", "https", "story_fbid", "story.php",

#After seeing the table of the 10 most frequent words, we need to get rid of some words, since we learned that they are part of a hashtag. So we are going to drop one of the words. For example, we can remove these:

"new", "concha", "born", "tambien", "mcomstoryphpstoryfbidid"

What was your resulting list of Most frecuent words?

R= It still needs to be cleaned. Maybe dividing the extense word "mcomstoryphpstory..." and remove each section. Also, it would be a better idea to work with a list of 6 or 7 most frequent words, since the corpus is very small.

palabra frec

comienzos comienzos 42

community community 41

israel israel 41

mexico mexico 40

ready ready 40

suenos suenos 40

sabias sabias 17

ano ano 12

mcomstoryphpstoryfbidid mcomstoryphpstoryfbidid 5

bien bien 2

In what part of the code did you changed the number of most frequent words?

R= exe_mat[1:10, ]

Here you can revise the code with some changes so it could be a little more accurate. The corpus is very small, and the organization make use of many hashtags, maybe extracting more posts could be more effective.

exe_raw <- read_lines("exerciseclean.txt")

exe_raw

str(exe_raw)

diez <- rep(1:ceiling(length(exe_raw)/10), each = 10)

diez <- diez[1:length(exe_raw)]

exe_text <-

cbind(

rep(1:ceiling(length(exe_raw)/10), each = 10) %>%

.[1:length(exe_raw)],

exe_raw

) %>%

data.frame %>%

aggregate(

exe_raw ~ V1,

data = .,

FUN = paste,

collapse=" ") %>%

select(exe_raw) %>%

as.matrix

dim(exe_text)

exe_text

exe_text <- gsub("[[:cntrl:]]", " ", exe_text)

exe_text

exe_text <- tolower(exe_text)

exe_text

exe_text <- removeWords(exe_text, words = stopwords("spanish"))

exe_text <- removeWords(exe_text, words = stopwords("english"))

exe_text

exe_text <- removeWords(exe_text, words = c("facebook", "http", "https"))

exe_text

exe_text <- removeWords(exe_text, words = c("story_fbid", "story.php"))

exe_text

#After seeing the table of the 10 most frequent words, we need to get rid of the following words, since we learned that

#they come from a hashtag, so we left the other word of the hashtag

exe_text <- removeWords(exe_text, words = c("new", "concha", "born", "tambien", "mcomstoryphpstoryfbidid"))

exe_text

exe_text <- removePunctuation(exe_text)

exe_text

exe_text <- removeNumbers(exe_text)

exe_text

exe_text <- stripWhitespace(exe_text)

exe_text

exe_corpus <- VCorpus(VectorSource(exe_text))

exe_corpus

exe_ptd <- tm_map(exe_corpus, PlainTextDocument)

wordcloud(exe_ptd, max.words = 80, random.order = F, colors = brewer.pal(name = "Dark2", n = 8))

#Do you still need to remove more words?

#exe_text <- removeWords(exe_text, words = c("facebook", "http", "https"))

exe_corpus <- exe_text %>% VectorSource() %>% VCorpus()

exe_ptd <- exe_corpus %>% tm_map(PlainTextDocument)

wordcloud(exe_ptd, max.words = 80, random.order = F, colors=brewer.pal(name = "Dark2", n = 8))

exe_tdm <- TermDocumentMatrix(exe_corpus)

exe_tdm

#Frequency of words

exe_mat <- as.matrix(exe_tdm)

dim(exe_mat)

exe_mat <- exe_mat %>% rowSums() %>% sort(decreasing = TRUE)

exe_mat <-data.frame(palabra =names(exe_mat), frec = exe_mat)

wordcloud(

words = exe_mat$palabra,

freq = exe_mat$frec,

max.words = 70,

random.order = F,

colors=brewer.pal(name = "Dark2", n = 8))

#Table of Frequencies

exe_mat[1:10, ]

#There were many "New" and "comienzos", and we know it is a # .

# Maybe we can remove one of the words of the hashtags so we have a more accurate list

#exe_text <- removeWords(exe_text, words = c("new", "concha", "born", "tambien", "mcomstoryphpstoryfbidid"))

#For some reason the "mcomstoryphpstoryfbidid" could not be remove, maybe we can divide that word in NotePad

#and run the code again.

#Graphics of Frequencies

#Bar charts Number of frequencies

exe_mat[1:10, ] %>%

ggplot(aes(palabra, frec)) +

geom_bar(stat = "identity", color = "black", fill = "#87CEFA") +

geom_text(aes(hjust = 1.3, label = frec)) +

coord_flip() +

labs(title = "Ten most frequent words", x = "Words", y = "Number of frequencies")

#Bar chart percentage of use

exe_mat %>%

mutate(perc = (frec/sum(frec))*100) %>%

.[1:20, ] %>%

ggplot(aes(palabra, perc)) +

geom_bar(stat = "identity", color = "black", fill = "#F5B041") +

geom_text(aes(hjust = 1.3, label = round(perc, 2))) +

coord_flip() +

labs(title = "Ten most frequent words", x = "Palabras", y = "Porcentaje de uso")

Home page

Page updated

Google Sites

Report abuse