Photo by Markus Spiske on Unsplash
Welcome to the course Web Scraping and Text Analysis in Bilingual Social Media.
This course has been designed thinking about students, researchers, or professionals whose training has been in the humanities or social sciences, who are interested in analyzing the narrative and reactions to certain posts on public Facebook pages, such as newspapers, organizations, or famous people, among others. Also, when designing these lessons I thought of people who have not had experience with the use of technological tools. As well as in those who might feel a little resistance to learn more complex skills like programming that takes a little more time to learn. This course is intended to be an easier way to accomplish a complete task and an introduction for many who will be preparing to learn more.
For this reason, in this basic course, three main lessons will be covered: Web Scraping Using Facepager; Cleaning data using Word, Notepad and RStudio; and Text Analysis using RStudio. For lesson one we will learn and become familiar with the Facepager software to extract information from Facebook, so that at the end of the lesson we will have a csv document with the data of the Facebook page of an association that writes in Spanish, English and Spanglish. In lesson two, we'll go back to the csv document to get the text messages from it and start cleaning them up and preparing them for its analysis. In lesson two we will use very simple tools, such as Word and Notepad, and we will reflect on possible difficulties in working with bilingual texts. Likewise, in this lesson we will get to know and become familiar with RStudio so that, in lesson three, we finish cleaning and preparing the text to perform the analysis. In lesson 3 we will perform a word frequencies analysis, some bar and wordle graphs of the list of more frequent words, and at the end of the lesson we will analyze the association between the most frequent words with other words of the text to learn about the context.
For this course, we will use the Facebook page of the organization Otros Dreams en Acción (Other Dreams in Action), as it is an organization that makes publications using both Spanish and English, as well as Spanglish on some occasions. In most of the activities we are going to use a database that we will create during class . Also, you will have access to files (see Required Data section) that contain more material to be used in each lesson. In lesson one we will create a .csv file, which is the product of the first lesson. In lesson two, we are going to change the format of our documents to a .doc and a .txt. and create a .r file. And finally, in lesson three we will continue working on the .r file and also in other .txt and .r files containing cleaned data and the necessary code to perform the cleaning and preparation of the text, as well as the textual analysis.
The course will be practical. I will be explaining the class and we all will be practicing step by step each of the actions to achieve the product of each lesson.
This course is for beginners, so the activities will be very straight forward. In the case of coding in RStudio, we will have the code already done, so we will only see a brief explanation of it. That is to say, the lesson is not made to learn coding in RStudio, since it implies more than one lesson, but to become familiar with its interface. Later, this will allow students to move forward and make their own attempts to extract data, to pre-process it, and to performa a text analysis based on their interests.
This course contains much of the information that we will see during the live sessions, so, you are welcome to read this notebook and complement it with the class videos that will be recorded.
This lesson is intended to teach you how to fetch data from Facebook and export it to your computer as a csv file. For that purpose we will get to know some functions of the Facepager software, and practice, step by step the extraction of texts and images from public pages.
Cleaning Data Using Word, Notepad and RStudio
This lesson is intended to teach you the basics tools to clean data and some components of the RStudio interface that will also help us complete data pre-processing. We will also reflect and propose strategies to clean data when it is bilingual.
Text Analysis Using RStudio
This lesson is intended to teach you some basic RStudio concepts, uses, and commands that can be used to prepare the text and run the analysis of the corpus. We will perform a word frequency analysis and then, graph our findings. We will also look for the words asociated with the most frequent concepts.
This notebook free for educational reuse under Creative Commons CC BY License.
Created by Rubria Rocha de Luna for the 2022 Text Analysis Pedagogy Institute, with support from the National Endowment for the Humanities, JSTOR Labs, and University of Arizona Libraries .
For questions/comments/improvements, email: rubria@gmail.com