Cleaning Data Using Word, Notepad and RStudio
This is lesson 2 of 3 in the educational series on Web Scraping and Text Analysis in Bilingual Social Media. This lesson is intended to teach you the basics tools to clean data, such as Word and Notepad and to teach you some components of the RStudio IDE such as the interface quadrants and how to open and quit a session, how to set a working directory and how to create, open, save and close a file, etc. so you can be ready to clean, prepare and perform the text analysis. We will also reflect and propose strategies to clean data when it is bilingual.
Audience: Learners
Use case: Tutorial (Learning-oriented)
A carefully constructed example that takes the user by the hand through a series of steps to learn how a process works. Tutorials often use "toy" (or at least carefully constrained) examples that give reliable, accurate, and repeatable results every time.
(https://constellate.org/docs/documentation-categories)
Difficulty: Beginner
Beginner assumes users are new to Facepager and RStudio. The user is helped step-by-step with explanatory text and examples. If you are a person who does not know how or where to begin web scraping and you have no experience on cleaning data and on coding for text analysis, this is a course for you. You will find step by step instructions and the simple code you need to run a text analysis based on word frequencies.
Completion time: 90 minutes
Knowledge Required:
* Familiarity with a computer Word processor and Notepad.
Knowledge Recommended:
Reflections on Text Analysis for Non-English Texts:
Dombrowsky Q. “Preparing Non-English Texts for Computational Analysis” Modern Languages Open, 2020(1): 45 pp. 1–9.
https://www.modernlanguagesopen.org/articles/10.3828/mlo.v0i0.294/
R and RStudio generalities:
The R Project for Statistical Computing https://www.r-project.org/about.html
R Studio Website. About: https://www.rstudio.com/about/
Learning Objectives: After this lesson, learners will be able to:
1. Manipulate and clean a bilingual text using Word and Notepad.
2. Reflect and develop diverse strategies to pre-process texts in English, Spanish and Spanglish.
3. Describe the generalities of the RStudio interface.
4. Open and quit a session, set a working directory, and create, open, save and close a file in RStudio.
This lesson is very important because in any text analysis project, you must spend most of the time cleaning and pre-processing your data for one reason: so that the computer can understand and process your data. If you don't clean it, your analysis may not run or it will, but you won't get valid results. Also, I would add another important reason to perform these activities, cleaning your data helps you learn and have a more deeply understanding of what you have in the extracted data, it will make you think about many other research questions and many other opportunities or lines that you can follow later.
You should learn that for cleaning and pre=processing data, you can take advantage of many resources. There are many tools out there, some of them have cost, some other are for free, some are easy to learn, while to learn others you have to spend time and effort. Other tools clean the data very fast, while other are more manual and can take a while.
In this lesson we will use Word and Notepad which are tools that many people know, they are easy to use, but the cleaning task could take you a little longer. However, using Word and Notepad, allows you to get to know your data pretty well and reflect on it.
Also, for the pre-processing, we will use RStudio IDE which is a user-friendly interface that has a script editor to help us write commands easily. The version we will use is free and it is used by many scholars because it provides statistical analysis.
In this lesson we will go over the RStudio IDE and explain the most important and useful tools for a very basic analysis, so that everyone can complete the text analysis of a corpus of extracted posts from Facebook. In lesson 2 we won't complete the pre-processing, but we will be more familiar with the interface and how to open and quit a session, how to set a working directory and how to create, open, save and close a file, etc. so you can be ready for lesson 3 where you will complete pre-processing the data and perform a word frequencies analysis, how to plot a couple of graphs and how to obtain the words that are closely related to the most frequent words of the corpus.
For this lesson you need to have your "tapiwebscraping" folder in your desktop. We will use some of the files there: the CSV file we got from lesson 1, and we will create other files: one file (.doc), one (.txt. ), two (.r)
This lesson is divided into three parts: 1) Introduction to Pre-processing texts, which contains the most important concepts related to the task and some reflections on the differences between analyzing data in other languages, particularly in Spanish. 2) Let's begin to clean the text, which is the part of the lesson where we will go over some cleaning tasks in Word and in Notepad, we will replace accent marks, typos and tildes and reflect on particularities of the bilingual data. 3) Getting to know RStudio, where we will go over the parts of the RStudio interface and some of the main functionalities of it.
That said, in this lesson we will not go over pre-processing in RStudio, but we will in lesson 3.
R and RStudio for performing text analysis
Installation instructions for R
Installation instructions for RStudio
You will need the CSV file (example.csv) we created in lesson 1.
Data Description:
This lesson uses the csv table (.csv format file) with the collected posts from Otros Dreams Facebook Page. Also, we will create other files to save in the "tapiwebscraping" folder: one file (.doc), one (.txt. ), two (.r) Finally we will use a cleaned .txt file created from the collected posts from the same association but from other time period.
Download Required Data
You have created a folder named "tapiwebscraping" for your desktop. Download these files below in case you need to update your folder.
First steps
In this lesson we are going to learn how to clean and preprocess text so that we can get the best results in the text analysis we will perform in Lesson 3. To begin, we will take the CSV file with the extracted information from Facebook that we got from Lesson 1. We will work with the message column, which are the text of the posts.
Select the message column and click “Copy”.
Next, open a Word document and open the Paste options menu and click on Paste Special.
Then, it will open a window with some paste options, click on “Unformatted Unicode Text”.
The text from the excel file will be copied in Word.
Now, we have all the text from the posts of the Facebook page in a more readable form. However, you will get many pages to analyze. In this case there are more than 200 pages, but you may get thousands of pages depending on the Facebook page you are analyzing. For that reason, it is important to use other tools to help us “read” all this information. Word, Notepad, R and RStudio will help us in this task. These tools will make the text readable to the computer.
Introducción al Pre-procesamiento de textos
The most important pre-processing tasks for text analysis are tokenization, sentence splitting in some cases, stop-word removal, stemming and lemmatization, among others. Tokenization consists of breaking the text down into words. Sentence splitting is used to define sentence boundaries. Stop words are words that do not carry important meaning, e.g., “the”, “is” y “and”. Stemming consists of transforming words into their root form: for example, the word “change” is changed to its root form “chang”. Lemmatization is a similar process where the reduction of the word is to its lemma, a root that has a meaning: for example, “change” to “change”.
Finally, text pre-processing is intended to divide the text into tokens for manipulation and analysis. The text we just created in Word contains:
Uppercase and lowercase letters.
Punctuation, question and exclamation marks.
Words in bold, or italics.
Special characters, such as #, *, @, -, ( ), /, among others.
Emoticons of all kinds.
Blank spaces between words.
Typos
Stop words.
Additionally, in languages other than English, we will find other elements that we will also have to clean, for example: accent marks, tildes and diaeresis.
Likewise, for an adequate count of words, tokenization in other languages can be difficult, for example, in the processes of stemming and lemmatization, agglutinating languages can be more complicated. These languages have words that cannot be easily reduced to their root because they depend on various morphemes to have a meaning. Likewise, when the level of inflection of the words is higher, that is, when the root of the words does not allow to understand the basic meaning of the word or when the rest of the word changes the meaning of it, it can be problematic.
Spanish language shares similarities with English in terms of the separation of lexeme and morpheme. For example, in English we can easily separate the noun work, and its derivations such as: work-er, work-ing, work-ed, etc. While in Spanish we can say that the noun “trabajo” (work) and its derivations: trabaj-ador (a) (es), trabaj-ando, trabaj-ó (third person past-tense) can be understood well. However, in Spanish we have gender and number in the endings of words, which means that in a frequency count they are considered as different words. Although, on many occasions, depending on the type of analysis, it may be convenient.
Let’s begin to clean the text
Cleaning texts is a boring task and can make you lose your patience. However, some people find this task relaxing and useful to know the information you will be working with… plus you can do it while listening to music. Text cleaning is the task that takes the most time in the process of text analysis and is a very important task to be successful. Sometimes you will feel that the text is already clean, but when running the analysis, you will see that some typos and stop words still come out, and you have to clean a little more, or , on the contrary, you can see that they are not relevant for your research that you don't have to get rid of them. For this reason, sometimes you must go back and clean up as needed until you see that the results are getting more accurate.
Word
· Acent marks, tildes and diaeresis
· Emoticones
· Typos
The first thing we will do is to work in Word to review the text and see possible typos. In Spanish language many of the typos are due to lack of accents in the words, so, to avoid major problems, what I have done is to remove all the accents, tildes and diaereses.
In your Word document go to “Home” and on the tool bar click on “Replace”
Next, in the box that opens, write the vowel á (with the accent mark). In the second box write the “a” without the accent mark. Then click on “replace all”.
That will replace all the accent marks on the letter “a” of all the words… except capital letters. When the process of replacing has finished, click on Yes. Try to do these actions at the beginning of the word document so that all the words or letters are replaced.
Click on “Accept”. This pop-up box says how many letters or words have been replaced.
Continue this process and do the same with the Á and replace it by “A” or “a”. Do not worry on the capital letter since we will lower all of them later. Continue cleaning the text with the rest of the accent marks (é, í, ó, ú) and their capital letter (É, Í, Ó, Ú). Next, replace the “ñ” by “n” or in some cases if the meaning can be confusing, you can change it by “ni”, for example for niños = ninos or años = anios. Finally, replace the dieresis (ü and Ü) by simple “u” and “U”.
NOTE: It is very important that when you write the replacement letter in the box, revise that there is not a white space before or after the letter. That would replace the letter with the new letter plus the with space and that will divide the word, e.g. allá by all a (that will make more difficult to detect the changes in the document and to undo the error).
Now, if you pay attention to the text, you can see that in some words, the ones that are in bold, the accent marks did not go away. Also, if you try to change the font of this words, you will see that it is not possible. These words for some reason are in Cambria Math font, which is a font used for mathematical and scientific texts, particularly to write equations. RStudio won't like these words, because they are not working as characters but as equations, so they are made of special characters that wont be readable. So let's try to get rid of their math structure and change them to regular words. Select the sentence and go to Edit. Then go to your right and click on Equations.
That will open a menu. On the bottom you will find: "Insert new equation". Click on it.
This will put the sentence in a box. Select the word again.
And right click so you can cut the sentence.
Go to Paste and then, click on "Paste special".
Select Unformatted unicode text, and click "Accept".
This will copy the unformatted text.
You can go and select each sentence and get rid of the math structure, but that sounds like too much work. So what you can do is to select the whole page and do the same procedure. The "insert new equation" function will select the whole page and put it in a box, then you can cut the page and paste it again in Unformatted unicode text and that's it. Also if you don't want to do this by page you can do the whole text once and for all, but it is going to take a while and Word will display a "Not responding message", but after a couple of minutes it will select the whole text and put it in a box, then you select it, cut it, and copy it again in the Unformatted unicode text.
Now, let's try another thing. In RStudio we can remove the extra white spaces between words, but also, we can do that in Word.
Go to File, then Replace, in the "Find" box type this: ( ){2,} and in the Replace by box, type: \1 Click on the "More" button and it will open more options below. In these options below, check the box that says: "Use wild cards" and finally click on "Replace all" button.
Now, the text has only one space between words.
Finally, to delete emoticons there are some commands from tm library and some new libraries in RStudio to do it, but as I have tried, they remove some, but I have found that there are so many more types of emoticons that I have found it easier to delete them manually. However, I will continue to look for an easy way to do it and update this notebook as soon as I can. We may follow the same process to remove words in Word or in Notepad, but now instead of writing the letter or word, we will copy the emoticon from the text to the box, or we can use a list of emoticons “cleaning key.doc” you may find in the folder for the course and also copy them one by one. In the case of the emoticons, I prefer to use Notepad because it seems to be a little faster in replacing the emoticons, and also because we can delet them directly very fast.
So, let's copy the whole text and paste it to Notepad. We can “select all” the text from Word. Or you can do it easily by going to Save as, and save it as a .txt file. That will do the same.
Click on Copy.
And open a new Notepad document and paste it.
Next, click on “View” and then on “Word wrap”. Your text will now remain inside the window only.
Then you can click on “Edit”, then “Replace” and you must copy the emoticon in the first box and left the second box empty. Click on “Replace all”
Now, we have cleaned a lot of text, typos, accent marks, tildes, diareses, and emoticons.
Look at the text. Have you seen other words that need to be normalized?
Some of these words are what now is called “inclusive language”, so you may see words like “nosotros”, “nosotras”, “nosotres” and “nosotrxs” and also you will find the English word for “nosotros”, “we”. This implies two things: five words mean the same. What should we do? Do we replace them for only one? What would happen if we leave them like that? If you decide to replace them, which one would you select and why?
Getting to know R and RStudio
R is a free software environment for statistical computing and RStudio IDE is an interface with a set of tools designed to help you with R. It consists of a console, a code editor, a window for plotting and other functionalities and a fourth window with the information on your data and the history of actions.
R console
Open the console by double-clicking on the R icon in your desktop.
There is a greater than symbol in red. That means that we can start typing commands. We can perform a few arithmetic operations to see how it works. Type: 10 + 5 and hit “Enter”. The solution appears and the greater than symbol appears again, which means that R is ready for the next command.
We won’t be working in R, but we need to have it installed since it is the brain of what we will be doing in the RStudio IDE. So, now click on “File” and then “Exit” on the scroll down menu.
Or you can write >q() in the command line to exit. It will ask you if you want to save the workspace image, click “No”.
Now, open RStudio by double-clicking the icon in case you put it in your tool bar, or you can go to the Start button and type “RStudio” on the search box. Click on the RStudio app to open it.
Now you will see the RStudio IDE. When you open the interface, you will see three quadrants, but if you open a RStudio file, it will open the IDE and you will see a 4th quadrant where you will see the script editor. Here in this screenshot, you see 3 windows.
The window on your left is the console, the same R console where we were performing some arithmetic. Look at the greater than symbol, now is blue. Let’s do the same arithmetic problems: 10+5 and hit “Enter”.
The window on the top right shows the history of your actions. Click on the History tab and you will see the arithmetic problems you typed. In the Environment tab, you will see the values and data information you will create while writing code. In the Connections Tab you can see the connections you may have to data bases, for example to a SQL database. And the Tutorial tab shows you how to run some tutorials.
On the bottom right window, in the Files tab, you will see your computer files displayed in a list. In the Plots tab you will see the resulting plot of your analysis which can be exported and saved as an image, as a pdf file or you can copy it to your clipboard and paste it where you need. The Packages tab displays the system library where you can see the default packages that are loaded at the moment. You can select other packages by clicking on Install. A small window will pop-up. There, write the name of the package you need and click “Install” (you don’t have to do this right now). The Help tab provides you with links to resources, manuals, community forums and information on diverse topics. And the Viewer Tab can be used to view local web content. For example, web graphics generated using packages like googleVis, htmlwidgets, and rCharts, (we won’t use this tab).
Now, let’s create a new script to open the fourth quadrant. Go to File, New file, and click on R Script.
This will open the fourth quadrant.
Here you can see the small tab of your script now untitled. We are going to save this file by going to file "Save as" and write "example” on the file name box and click on “Save”. That will save the script as a .r file. Also, below the file tab, you can see the little floppy disk to save changes in the script.
In this script editor, we will be writing some commands to perform some pre-process activities and some basic text analysis.
To finish this lesson, save the changes of your script, by clicking on the little floppy disk and save and close other scripts you may have opened. Then, go to Session and click on Quit Session.
Now, we will set a working directory. So go to Session, go to Set Working Directory and click on "Choose Directory".
That will open a window where you are going to select the folder. Select the “tapiwebscraping” folder, or the one you have with the .txt file to be analyzed. Now look at the console and see how the working directory command will appear.
You can set the working directory in the console as well. Write setwd open parenthesis and quotation marks and copy the folder address.
To get the folder address, open the folder and right-click on the name of the folder. Click on “Copy address” and paste it in the console. Or you can click next to the name of the folder and that will show you the complete address. Copy it and paste it in the console between the quotation marks. Put the cursor at the end of the sentence and run the code.
Note: If you use Windows and copy the address, you will have to change the side of the slash because it won’t work.
Now, lets start typing in the script editor.
#When you type a hash at the beginning of the sentence, it allows you to write text with no consequences for the code. If
#you want to start a new line of text so everything is in the window, you have to type the hash again.
#Now let's download some packages that will help us perform the pre-process and text analysis.
To download these packages, we are going to call the system library by typing “library”. As you can see, there is a box that is trying to help you with some information of what you need to complete the command.
After “library” open parenthesis. That will open a message box that will show you the list of packages and a brief explanation of them. You can click on the library name in the scroll down menu, or you can type it directly between the parenthesis in the script.
We are going to download the “tm” text mining package. After typing, run the code. As you can see after running the code, we got the message "Loading required package: RColorBrewer". So, let's download it.
Now let’s run the tm package again so it can be downloaded.
After running the ggplot2 or dplyr packages, it appears a warning message: Attaching package... When you see a message like this, my suggestion for you is to copy the text to your browser and look to what others have done about it. This is what I got: “This message is a warning, letting you know that dplyr has a few objects (functions in that case) that share the same name as base-r and R's stats library." Then the people suggested some code to get rid of the warning. It seems that nothing wrong happened here, so let's continue.
Now, we have downloaded all the packages. Let’s continue with another command. We are going to create an object called “oda_raw” that will have the .txt file in it so RStudio can read it. After the name, we will type a minus that symbol and a dash one space and “read_lines” followed by parenthesis where we are going to type the name of the .txt file. We won’t close parenthesis so you can see how this editor help us find possible typos in the code and correct. Look at the left side of the line of code and you will see a small red circle with a X in the middle. That means that something is wrong in the code. If you run the command like this, you will get an error message. Type the closing parenthesis and the small circle will go away.
Run the command again and the greater than symbol will appear in the console, which means that the computer understood the command and is waiting for your next instruction.
We are going to type “oda_raw” again and run it. Look what we have now. In the console you can read the text, all the posts and see the Environment tab, there you will find the information about the “oda_raw” value you have created. For the image I used a .txt with a little more than a thousand posts, so it says that it is an object of the type character with 1,059 elements (posts).
At the end of the posts, we can see this message below. It means that we are reading 1000 posts which is the maximum that can be printed in RStudio, the rest are there but not printed (not visible).
This is it for today. In lesson 3 we will continue. Now, let’s save what we have done so far by going to File and then Save as. Save the document in the same folder “tapiwebscraping”. Next, quit session by going to Session, and then Quit Session.
Congratulations!!! you have practice cleaning data, you got to know some important things about R and R Studio, and you were able to create a .r file with your first command to read your extracted posts in the RStudio interface. WOW, way to go, guys!
From the .csv file you got from exercise 1, copy the posts to Word or Notepad and try to clean that text the best you can using the tools we learned in this lesson. Try to remove the following:
Accent marks
Tildes
Diaeresis
Bold and italics
Question and exclamation marks
Typos
Cambria Math font
Extra white spaces between words
Emoticons
Make sure to save the final document as a .txt file. We will use it for the Exercise 3.