This is lesson 3 of 3 in the educational series on Web Scraping and Text Analysis in Bilingual Social Media. This lesson is intended to teach you some basic RStudio concepts, uses, and commands so you can prepare the data for analysis and perform text analysis of a corpus. We will go over some basic syntax and the basic command structure by revising basic commands for text pre-processing, text analysis and plotting graphs.
Audience: Learners
Use case: Tutorial (Learning-oriented)
A carefully constructed example that takes the user by the hand through a series of steps to learn how a process works. Tutorials often use "toy" (or at least carefully constrained) examples that give reliable, accurate, and repeatable results every time.
(https://constellate.org/docs/documentation-categories)
Difficulty: Beginner
Beginner assumes users are new to Facepager and RStudio. The user is helped step-by-step with explanatory text and examples. If you are a person who does not know how or where to begin web scraping and you have no experience on cleaning data and on coding for text analysis, this is a course for you. You will find step by step instructions and the simple code you need to run a text analysis based on word frequencies.
Completion time: 90 minutes
Knowledge Required:
Generalities of the RStudio interface
Be able to open and quit a session, set a working directory, and create, open, save and close a file in RStudio.
Knowledge Recommended:
By reviewing a list of packages and what they do, you can get a better idea of what can be done in RStudio.
* Packages in RStudio
https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
Other books to read and practice:
Jockers, Matthew. Text Analysis with R for Students of Literature. Springer. 2014
Taylor, Arnold and Tilton Lauren. Humanities Data in R. Exploring Networks, Geospatial Data, Images, and Text . 2015
Learning Objectives: After this lesson, learners will be able to:
1. Describe some concepts of R syntax.
2. Read basic lines of R code.
3. Create objects, apply functions and arguments for cleaning, pre-processing, analysis and plotting of the extracted text from Facebook.
4. Be familiar with the kind of information they will have to look for in future text analysis projects.
5. Export graphs as images or as .pdf files.
Introduction
Text analysis is the process of deducting meaning, finding facts, relatonships and assertions from text and other written communications such as the ones we find in social media (posts, comments, etc.) or in reviews from users of products, or user feedback. Text analysis consist of some techniques that help us understand and analyse large amounts of unestructured data. Some Text analysis techniques help in the process of word frequencies, categorization, clustering, pattern recognition, and visualization. It is used in marketing, advertising, but also in academic research from sociological, psychological, linguistic, rhetorical, and many other diverse approaches. However, Text analysis should be learned not just for specialized people, buy by anybody, since it is a new way of reading not only books in a corpus, but to be able to read the reality that surround us in many ways of communication.
Learning text analysis allows us to approach corpus of texts that we could not read naturally. With the help of technology, we can have a greater knowledge and a broader perspective of the content of a text, or as Matthew Jockers says in his book Text Analysis with R for Students of Literature (2014), many times it is just a matter of testing and verifying some hypotheses we have about a corpus. In my experience, more than verifying, it has been discovering many other significant aspects of the narrative that are not seen with the naked eye.
RStudio is a very helpful software that allow us performe text analysis since the pre-processing, the actual analysis and the plotting of graphs. Also, there are some packages in this software that may help us begin since the webscraping process to extract data from websites. RStudio is used by many people from a very diverse background, because it is for statistical analysis. In this lesson, our focus is on learning basic word frequency analysis of extracted text from an Association whose purpose is to help and build community among returned migrants to Mexico. So, the center of this lesson is based on some strategies we could improvise to analyse text when it is bilingual and bicultural. For this reason, this lesson is divided in the following sections:
1) Introduction, where we are going to see the basic structure of a command.
2) Pre-processing. In this section we will learn some commands to prepare the text for analysis.
3) Analysis of the corpus, where we are going to make some decision making, more cleaning and a table of word frequencies.
4) Frequency Graphs. In this part of the lesson, we will learn some commands to plot two bar charts.
5) Correlation between words. Finally, we will learn am command to see the correlation between the most frequent words and the rest of the corpus.
That said, in this lesson we will not go over learning to code, but we will learn basic R command structure and some commands used for pre-processing, analysis and ploting graphs, so that we can read what we are doing in every step of the code.
R and RStudio for performing text analysis
Installation instructions for R
Installation instructions for RStudio
You will need some of the files from the "tapiwebscraping" folder.
Data Description:
This lesson uses the .txt file and the .r file we created at the end of lesson 2. Also, we will take a look at some of the files in that folder and create a new .r file to perform the text analysis and to plot a graph with a cleaned .txt file.
Download Required Data
You have created a folder named "tapiwebscraping" for your desktop. Download the files below in case you have to update your folder.
1) Introduction
library(tm)
library(NLP)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
library(dplyr)
library(readr)
library(cluster)
#In the console you will see what we fixed in lesson 2:
#Sintax
#In R the most basic command follow this order:
(Assigner)
#Object <- Function ("Argument 1", "Argument 2", "Argument 3")
#Now we are going to create an object. For this lesson we are going to use the "odaclean.txt file which is the one we extracted since lesson 1. Then, we will use the read_lines function from the "readr" library.
oda_raw <- read_lines("odaclean.txt")
# Now, lets type the name of the object and run.
oda_raw
#In the console you will see the complete text because this .txt file contains only 103 posts.
#The oda_raw object is a character "chr" type with 103 elements.
#The "string" function str() in R Language is used for compactly displaying the internal structure of the object.
str(oda_raw)
#In the console you will see:
#Creating paragraphs
#Since posts is social media, and particularly in our text, we are going to create a vector called: diez (10) with ten repetitions (rep) of the numbers from 1 to the number of lines in the document divided by 10 (length(oda_raw)/10.
#By doing this, we will have a vector of groups of 10 lines (posts) each until the end of the 103 posts.
#diez <- rep(1:ceiling(length(oda_raw)/10), each = 10)
#This vector may not have the same number of lines per elements since each post has a different lenght.
#diez <- diez[1:length(oda_raw)]
#Now, we will create a new object (oda_text), so we will assign these groups of 10 to oda_text. We will have a column with the lines of text and another one with a number that identify each group.
#Also, we will make a data.frame so that columns can be identified with a name that will be helpful in the following steps.
#We use aggregate to concatenate the lines (posts) (FUN = paste, with collapse = " " to preserve the white space between words.
#What we are going to have now is a one column with the paragraphs and the number of the group.
#We will also transform oda_text into a matrix, since this will help us in the next steps.
oda_text <-
cbind(
rep(1:ceiling(length(oda_raw)/10), each = 10) %>%
.[1:length(oda_raw)],
oda_raw
) %>%
data.frame %>%
aggregate(
oda_raw ~ V1,
data = .,
FUN = paste,
collapse=" ") %>%
select(oda_raw) %>%
as.matrix
dim(oda_text)
#In the console you will see:
#Meaning that we have 11 groups.
#Type:
oda_text
#In the console you will see the divided paragraphs.
2) Pre-processing
#Now, we are going to continue with the pre-processing by removing special characters using regex. Regex will help us identify special characters such as: line breaks and tabs. We will use the regex [[:cntrl:]]
oda_text <- gsub("[[:cntrl:]]", " ", oda_text)
oda_text
#We need to get rid of some of the words such as: "http", "https", "bitly"... are there any more words to remove?
oda_text <- removeWords(oda_text, words = c("http", "https", "bitly"))
#Next, let's transform all the text to lower case letters.
oda_text <- tolower(oda_text)
oda_text
#Let's use removeWords with the stopwords("spanish") to remove the word from this langage that carry very little useful information.Example of these words are: prepositions and fillers. Then, we will do the same for the English Stop Words. Remember that these posts are written in both languages.
oda_text <- removeWords(oda_text, words = stopwords("spanish"))
oda_text <- removeWords(oda_text, words = stopwords("english"))
oda_text
#Then, we will remove punctuation.
oda_text <- removePunctuation(oda_text)
oda_text
#EIn this case, we will remove numbers, since we are not interested in dates or any other numerical information.
oda_text <- removeNumbers(oda_text)
oda_text
#Next, we will remove the extra white spaces between words. Some of them were produced after the changes we made.
oda_text <- stripWhitespace(oda_text)
oda_text
#In the console you will see a cleaner text:
3) Analysis of the Corpus
#Now that the text is prepared for the analysis, we will create a corpus. This corpus is made up of all the paragraphs. This corpus will be assigned to a new object called: oda_corpus. We will use the VectorSource and Corpus functions for this purpose.
oda_corpus <- VCorpus(VectorSource(oda_text))
oda_corpus
#In the console you will see:
#WordCloud
#Next, we will map our corpus. We will use the tm_map and PlainTextDocument functions for this purpose. We create a new object to assign these functions.
oda_ptd <- tm_map(oda_corpus, PlainTextDocument)
#Now, we will be able to create a word cloud using the library of this same name. This word cloud will have the most frequent words of the corpus.
wordcloud(oda_ptd, max.words = 80, random.order = F, colors = brewer.pal(name = "Dark2", n = 8))
#In the plot window you will see:
# Do you see any more more words that need to be removed?
# If your answer is yes, we can proceed to a second round for cleaning using the removeWords function.
# This function requires the vector of characters that we used earlier (oda_text).
oda_text <- removeWords(oda_text, words = c("zwgmj"))
#Once we have cleaned the oda_text for the second time. We have to use this new version of the oda_text to generate a new corpus and map it again.
oda_corpus <- oda_text %>% VectorSource() %>% VCorpus()
oda_ptd <- oda_corpus %>% tm_map(PlainTextDocument)
#Next, we will create a new word cloud, that may be different.
wordcloud(oda_ptd, max.words = 30, random.order = F, colors=brewer.pal(name = "Dark2", n = 8))
#In the plot window you will see:
#Term Document Matrix
#Now, we will map the corpus as a matrix, so that in the future steps we can identify correlations between words.
#The Document Term Matrix lists all occurrences of words in the corpus and by document. In the matrix we can see the documents in rows and the words in columns. This matrix will assign a number according to the number of a word occurs in a particular document. If it does not appear in the document, then the entry is "0", if it is only one word in the document, then "1" and if the word is twice in the same document, then, the number will be "2".
# We will use the function TermDocumentMatrix in the corpus and assign the result to a new object called: oda_tdm.
oda_tdm <- TermDocumentMatrix(oda_corpus)
oda_tdm
#In the console you will see:
#According to the resulting information we have 1408 terms in the 11 documents (paragraphs). This means that there are 1408 unique words in the corpus.
Non-/ Sparce entries: 2517/12971
This means that in 2517 entries there are words (1408 of them appear only once in the corpus / while the sparce entries are 12971 which means that in that number of entries we may find zeros. In this case, 84% of the matrix entries are zeros.
#Word frequency
#Now, it is time to prepare the corpus to see quantities of the frequency of words. We need to transform the oda_tdm object in a matrix object. This object will have the number of unique words as lines and the number of documents as columns.
oda_mat <- as.matrix(oda_tdm)
dim(oda_mat)
#In the console you will see:
#Now, we will get the sum of lines (rowSums) ordered from highest to lowest sort(decreasing = TRUE) to learn the frequency of each word. After that, we will transform the results to a new object of data.frame class, so we can have two columns, word and frequency in a table.
oda_mat <- oda_mat %>% rowSums() %>% sort(decreasing = TRUE)
oda_mat <- data.frame(palabra = names(oda_mat), frec = oda_mat)
#With the matrix object we can also create a word cloud.
wordcloud(
words = oda_mat$palabra,
freq = oda_mat$frec,
max.words = 70,
random.order = F,
colors=brewer.pal(name = "Dark2", n = 8))
#In the plot window you will see:
#Next, let's type:
oda_mat[1:20, ]
#In the console you will see the table of frequencies:
4) Frequency Graphs
We will create a bar chart with the number of frequency of the words. For that purpose, we will use ggplot2. Ggplot has its oun functions that we will ot see here, but it is important to mention that we are using the oda_mat for this purpose.
oda_mat[1:20, ] %>%
ggplot(aes(palabra, frec)) +
geom_bar(stat = "identity", color = "black", fill = "#87CEFA") +
geom_text(aes(hjust = 1.3, label = frec)) +
coord_flip() +
labs(title = "Twenty most frequent words", x = "Words", y = "Number of frequencies")
#In the plot window you will see:
#Now, instead of using the number of frequencies, we are going to use percentage of use. For this purpose we will use the dplyr library.
oda_mat %>%
mutate(perc = (frec/sum(frec))*100) %>%
.[1:20, ] %>%
ggplot(aes(palabra, perc)) +
geom_bar(stat = "identity", color = "black", fill = "#F5B041") +
geom_text(aes(hjust = 1.3, label = round(perc, 2))) +
coord_flip() +
labs(title = "Ten most frequent words", x = "Words", y = "Percentage")
#In the plot window you will see:
5) Correlation between words
We are going to use a vector, so we are going to be able to introduce more than one word in the command.
The function we will use is findAssocs. First we will use only one word. In this case, let's find the correlation of "join"
with other words from the text.
The correlation (corlimit) near zero means that the words do not correlate. On the contrary, if the values is close to 1, means that the terms correlate.
Let's use the .7 and see what happens:
findAssocs(oda_tdm, terms = c("join"), corlimit = .7)
#In the console you will see:
#As you can see there are 9 words that higly correlate with join. You may change the corlimit to see other correlations. Also you can find more associations of the most frequent words and use a different corlimit and see what happens. Also, you will decide the corlimit according to your own research interests.
Congratulations!!! You have completed a basic text analysis and you have created word clouds and graphics to plot the data from extracted text. YAY! You are ready to work by yourself and extract data of your interest, clean it and analyze it. Best of luck!
In case you have not created a new folder in your desktop to save the files from the Exercises1 and 2, please do, so.
Open in RStudio your .txt file you saved from Exercise 2. Don't forget to Set the Working Directory to the folder where this .txt file is located.
Pre-processing
Following the commands we saw in Lesson 3, complete the pre-processing of the text. Pay special attention to the resulting text and decide if you need to remove other words that are not in the Stop words lists.
Text Analysis
Ask for the 10 most frequent words
Graph
Make a word cloud and a bar chart of the most frequent words.
Questions:
What words did you remove and why?
What was your resulting list of Most frecuent words?
In what part of the code did you changed the number of most frequent words to get the list?
What words did you remove and why? " R= "facebook", "http", "https", "story_fbid", "story.php",
#After seeing the table of the 10 most frequent words, we need to get rid of some words, since we learned that they are part of a hashtag. So we are going to drop one of the words. For example, we can remove these:
"new", "concha", "born", "tambien", "mcomstoryphpstoryfbidid"
What was your resulting list of Most frecuent words?
R= It still needs to be cleaned. Maybe dividing the extense word "mcomstoryphpstory..." and remove each section. Also, it would be a better idea to work with a list of 6 or 7 most frequent words, since the corpus is very small.
palabra frec
comienzos comienzos 42
community community 41
israel israel 41
mexico mexico 40
ready ready 40
suenos suenos 40
sabias sabias 17
ano ano 12
mcomstoryphpstoryfbidid mcomstoryphpstoryfbidid 5
bien bien 2
In what part of the code did you changed the number of most frequent words?
R= exe_mat[1:10, ]
Here you can revise the code with some changes so it could be a little more accurate. The corpus is very small, and the organization make use of many hashtags, maybe extracting more posts could be more effective.
exe_raw <- read_lines("exerciseclean.txt")
exe_raw
str(exe_raw)
diez <- rep(1:ceiling(length(exe_raw)/10), each = 10)
diez <- diez[1:length(exe_raw)]
exe_text <-
cbind(
rep(1:ceiling(length(exe_raw)/10), each = 10) %>%
.[1:length(exe_raw)],
exe_raw
) %>%
data.frame %>%
aggregate(
exe_raw ~ V1,
data = .,
FUN = paste,
collapse=" ") %>%
select(exe_raw) %>%
as.matrix
dim(exe_text)
exe_text
exe_text <- gsub("[[:cntrl:]]", " ", exe_text)
exe_text
exe_text <- tolower(exe_text)
exe_text
exe_text <- removeWords(exe_text, words = stopwords("spanish"))
exe_text <- removeWords(exe_text, words = stopwords("english"))
exe_text
exe_text <- removeWords(exe_text, words = c("facebook", "http", "https"))
exe_text
exe_text <- removeWords(exe_text, words = c("story_fbid", "story.php"))
exe_text
#After seeing the table of the 10 most frequent words, we need to get rid of the following words, since we learned that
#they come from a hashtag, so we left the other word of the hashtag
exe_text <- removeWords(exe_text, words = c("new", "concha", "born", "tambien", "mcomstoryphpstoryfbidid"))
exe_text
exe_text <- removePunctuation(exe_text)
exe_text
exe_text <- removeNumbers(exe_text)
exe_text
exe_text <- stripWhitespace(exe_text)
exe_text
exe_corpus <- VCorpus(VectorSource(exe_text))
exe_corpus
exe_ptd <- tm_map(exe_corpus, PlainTextDocument)
wordcloud(exe_ptd, max.words = 80, random.order = F, colors = brewer.pal(name = "Dark2", n = 8))
#Do you still need to remove more words?
#exe_text <- removeWords(exe_text, words = c("facebook", "http", "https"))
exe_corpus <- exe_text %>% VectorSource() %>% VCorpus()
exe_ptd <- exe_corpus %>% tm_map(PlainTextDocument)
wordcloud(exe_ptd, max.words = 80, random.order = F, colors=brewer.pal(name = "Dark2", n = 8))
exe_tdm <- TermDocumentMatrix(exe_corpus)
exe_tdm
#Frequency of words
exe_mat <- as.matrix(exe_tdm)
dim(exe_mat)
exe_mat <- exe_mat %>% rowSums() %>% sort(decreasing = TRUE)
exe_mat <-data.frame(palabra =names(exe_mat), frec = exe_mat)
wordcloud(
words = exe_mat$palabra,
freq = exe_mat$frec,
max.words = 70,
random.order = F,
colors=brewer.pal(name = "Dark2", n = 8))
#Table of Frequencies
exe_mat[1:10, ]
#There were many "New" and "comienzos", and we know it is a # .
# Maybe we can remove one of the words of the hashtags so we have a more accurate list
#exe_text <- removeWords(exe_text, words = c("new", "concha", "born", "tambien", "mcomstoryphpstoryfbidid"))
#For some reason the "mcomstoryphpstoryfbidid" could not be remove, maybe we can divide that word in NotePad
#and run the code again.
#Graphics of Frequencies
#Bar charts Number of frequencies
exe_mat[1:10, ] %>%
ggplot(aes(palabra, frec)) +
geom_bar(stat = "identity", color = "black", fill = "#87CEFA") +
geom_text(aes(hjust = 1.3, label = frec)) +
coord_flip() +
labs(title = "Ten most frequent words", x = "Words", y = "Number of frequencies")
#Bar chart percentage of use
exe_mat %>%
mutate(perc = (frec/sum(frec))*100) %>%
.[1:20, ] %>%
ggplot(aes(palabra, perc)) +
geom_bar(stat = "identity", color = "black", fill = "#F5B041") +
geom_text(aes(hjust = 1.3, label = round(perc, 2))) +
coord_flip() +
labs(title = "Ten most frequent words", x = "Palabras", y = "Porcentaje de uso")