Lesson 1

Web Scraping Using Facepager

This is lesson 1 of 3 in the educational series on Web Scraping and Text Analysis in Bilingual Social Media. This lesson is intended to teach you how to fetch data from Facebook and export it to your computer as a .csv file. For that purpose we will get to know some functions of the Facepager software, and practice, step by step the extraction of texts and images from public pages.

Audience: Learners

Use case: Tutorial (Learning-oriented)

A carefully constructed example that takes the user by the hand through a series of steps to learn how a process works. Tutorials often use "toy" (or at least carefully constrained) examples that give reliable, accurate, and repeatable results every time.

(https://constellate.org/docs/documentation-categories)

Difficulty: Beginner

Beginner assumes users are new to Facepager and RStudio. The user is helped step-by-step with explanatory text and examples. If you are a person who does not know how or where to begin web scraping and you have no experience on cleaning data and on coding for text analysis, this is a course for you. You will find step by step instructions and the simple code you need to run a text analysis based on word frequencies.

Completion time: 90 minutes

Knowledge Required:

* You should know basic concepts related to Facebook pages, such as: posts, comments, replies, date of publication, etc.

* Note: you must have a Facebook account so that you can fetch data in Facepager.

Knowledge Recommended:

· About Facepager:

Till Keyling’s Blog September 2015 http://tillkeyling.com/facepager-what-it-is-what-its-not.html

· About APIs:

FreeCodeCamp December 2019 https://www.freecodecamp.org/news/what-is-an-api-in-english-please-b880a3214a82/

Learning Objectives: After this lesson, learners will be able to:

1. Extract text posts, comments, replies and images from Facebook using Facepager software.

2. Export data in a csv format.

3. Be familiar with the use of the Facepager interface for the extraction of data from other social networks.

Introduction

This lesson is intended for people who are new to extracting data from web pages, particularly from Facebook. It is for those who have pursued humanities or social studies who does not have experience in using technology to approach texts from social media. Many of the research questions from people in these areas of knowledge, require approaching content that cannot be easily analyzed because it needs to be extracted from web pages. This course can be very helpful so that, without the need for programming knowledge, people can extract data from Facebook quickly, easily, with enough flexibility and in a very friendly interface (Facepager).

By becoming familiar with the Facepager interface, it is possible to obtain the necessary information so that researchers may study the narrative, rhetoric or discourse that is presented on Facebook pages of newspapers, organizations, certain political figures or celebrities. In addition, researchers can perform social or psychological research by extracting comments, responses to comments, images and many more types of data with Facepager.

In general, web scraping is used in multiple spaces, both in media, as well as in business, companies, politics, social psychology and mainly for advertising and marketing. However, Facepager is a tool that, although it allows the user a lot of freedom to define the parameters of the information they need, it has its limitations, and for this reason, Facepager is defined by its developers as a tool for scientific and research use, more than advertising and marketing.

To learn web scraping with Facepager it is necessary to install the software, following the steps described in the Facepager Inatallation Instructions (.pdf document). Likewise, it is important to be patient and think that sometimes it will be necessary to troubleshoot for some of the steps. Facepager, although it is a relatively simple tool, sometimes, it shows an error message just by one missing step. Sometimes, we will need to repeat the steps two or more times until we find the missing step. In the live session I will try to help you solve some problems that may arise, but it is also important to remember that there are various forums that can help you find a solution. Some forums for this are the Facepager user group, its Wiki, or you can simply write the error message in the browser and you may find a solution that other people have found.

This lesson is divided into four parts: 1)Before Starting, which contains information about Facepager, APIs, and what we need before we start using Facepager; 2) Getting to know Facepager IDE, where we will review the main components of the interface, its windows and its functionalities; 3) Extracting Facebook Data, where we will review step by step the actions that must be carried out to extract the publications of a page of a migrant association; 4) Exploring other parameters, where we will see other parameters that can be set up to extract more information, such as comments, replies and images.

That said, in this lesson we will not go over extracting data from other platforms, or extracting other types of content, such as videos or reactions. We will only make a couple of changes in the Presets and Parameters box so that participants can later continue their exploration of other functionalities according to their own interests.

In this lesson you will find the explanation of each of the four sections and the links to the files you may need.

Required Software

Facepager for performing web scraping from Facebook

Note: You need to have a Facebook account to fetch data in Facepager.

Facepager Installation Instructions

Facepager Installation Instructions.pdf

Required Data

At the beginning of this lesson, you don't need any data. We are going to be working with a Facebook page to extract posts and then, at the end of this lesson, we will create a .csv file to save it in a folder that you will create (see below).

Data Source:

https://www.facebook.com/OtrosDreams This page is where we are going to extract data.

Data Description:

This lesson uses data in .csv format from a Facebook page. The data base consists of posts from a Facebook page that belongs to an association of returned migrants in Mexico. Particularly, this association publish their texts in English, Spanish and Spanglish in the same post.

Download Required Data

You must create a new folder in your desktop. Save it as: tapiwebscraping. Next, download the files below and save them in the tapiwebscraping folder.

At the end of the lesson 1 you will create a .csv file. You must save it into the tapiwebscraping folder.

Lesson 1

1.1 Before starting

Facepager is a software developed by Jakob Jünger and Till Keyling. Facepager Integrated Development Environment (IDE) is a tool to collect data from APIs where you do not need to use any programming language.
An API is the part of the server from Facebook or other sites that receives requests and send responses. Facepager establish communication with API to request access to the specific data and it depends on the site privacy policies at the time if they response favorably to the request or not. APIs are like the front desk person in a company and Facepager is the person who asks for information.

o Facepager is a very friendly tool for those of us who do not have experience on programming but are eager to start doing some extraction of data and analysis. For that matter, we just need the following: The URL of the Facebook page. For this purpose, we will use the Facebook page of an association called Otros Dreams en Acción, its URL is: https://www.facebook.com/OtrosDreams

o Facebook allows Facepager to extract information from public Facebook pages, for example from associations, organizations, from celebrities, politicians, etc. However, it does not allow you to fetch Facebook private groups or private personal accounts.

o You must own a Facebook account, because Facepager requests you to log in before fetching data.

o You need to have a name ID or numeric ID from the Facebook page you plan to fetch. To get the numeric ID of the Facebook page, you need to copy the URL in the box provided by several sites such as: https://smallseotools.com/find-facebook-id/ There, you will get a numeric code. Keep that number so you can use it when Facepager requests it.

o It is important to define the parameters of the information you want to get: for instance: Do you just need the text from the posts? Or do you want to extract images as well? Do you want to extract the comments of a particular post? Do you want to get the replies to comments? Do you need the posts from a specific date or period? etc.

o For this lesson, we will fetch the text of the posts created by the association (meaning that the posts are written by the association and not a repost of someone else’s post).

We will practice fetching data from a Facebook page that uses English, Spanish and Spanglish sometimes in the same post. This will imply certain difficulties in the following two lessons.

Jünger, Jakob / Keyling, Till (2020). Facepager. An application for automated data retrieval on the web. Source code and releases available at https://github.com/strohne/Facepager/.

1.2 Getting to know Facepager IDE

New Database -> Create a new database.
Open Database -> Opens a database you have saved before.
Add Nodes -> This will open a pop-up window requesting the name Id, or number Id of the Facebook page(s) you want to fetch. One name or numeric id per line.
Delete Nodes -> If the result of the fetch did not come out as you wanted, you can select the nodes and delete them.
Presets -> This section helps you with the parameters. It contains the information of the type of data you need, for example, if you want to be able to fetch images and the default parameters do not have the option, you can load and apply those changes, so they appear in the parameters box.
APIs -> This section shows you the type of data you can fetch according to the API you establish communication with. In the Base path (Parameters box) you have the name of the API you are contacting, and you can change the API to another one by clicking in the APIs button and decide the one you prefer according to the data it allows you to fetch.
Export Data -> After fetching data you can export it to your computer using this function.
Help -> This provides you of some resources to learn more about Facepager, and places where you can interact with other people for troubleshooting.
Expand Nodes -> After fetching data, you can click on Expand nodes and you will see all the data you just extracted. Also, you can see a drop-down arrow on the left of the node. Click on the arrow and it will display all the extracted data.
Collapse Nodes -> Select a node and click on “Collapse nodes” and they will be hidden again. The same will happen if you click again in the drop-down arrow.
Find Nodes -> It displays a box where you can write the name of the object id, object type or query information to find a particular node. It is useful when you have fetched large amounts of data, and from different nodes.

Object Window -> This is the window where the nodes are displayed. There you will be able to see all the messages (posts), the object type, the date of the post, the link to the image, etc. in columns and lines.
JSON Window -> When selecting one node, in this window you will see the key (the type of information you requested -the names of the columns in the object window-); and the value, which is the extracted content.
Column Window -> In the Column window you can see the title of the columns that are displayed in the Object Window. By selecting the key from the JSON Window and then click on the Add Colum button you can display the column in the Object Window.
Parameters Box -> This box contains the Base path (which is the API link); the Resource (which is the type of information you want to fetch, for example, posts, comments, likes, reactions, images, etc.); the Parameters, which give you a few more options to select, for example, the since and until to set up a specific time for the data you want to extract.
Settings -> This box is mainly used when you will fetch large amounts of data. It gives you the option to define the level of nodes (when you have a parent node and then child nodes, for example), the amount of request per minute, or if you want to create header nodes, etc. In this lesson we won’t use this box.
Fetch Data button -> After setting up the parameters and all, you can click on Fetch Data to extract the information from Facebook.
Status Log -> it reports every action you perform. Also, it provides you with feedback in case of an error. For instance: when a name id is not correct it displays a message explaining that, so you can fix the error.

1.3 Extracting Facebook Data

Open Facepager IDE
Create a New Database and save it as “example” on your desktop.
Add Nodes. It will request you to write the Object ID of the page or pages you want to fetch. Copy the numeric ID and click OK. Facebook page: https://www.facebook.com/OtrosDreams Find facebook ID https://smallseotools.com/find-facebook-id/

Now you should see a node in the Object Window.

· Log in to Facebook by clicking on the button from the Parameters Box.

· Now we are going to set up our parameters:

o Click on the dropdown box from Resources.

o Select /<page-id>/posts. *If you don’t have that option, you must go to Presets and select Facebook. That will open some options for you to load and apply to your Facepager.

o After that, proceed to go back to Resources and select /<page-id>/posts.

Click on the Fetch Data button

· After the process is complete, click on Expand nodes and you will see the data you extracted.

· Now explore the data and decide whether you want to keep all the columns or not, or if you want to add more information. For that purpose, click on “data” on one of the lines of the fetched data and on the JSON window will appear all the information that was fetched.

· On the Columns Window you will see the name of the columns that are displayed in the Object Window. If you want to add another column, click on the JSON line that contains the name of the information you want to be included and click on Add Colum (right below the window). That will add that category to the displayed columns in the Object Window.

· The information displayed in the Object Window is the one that will be exported to a csv file.

· To export the data to a csv file, click on “Export Data”. That will open a window where you have to give a name to the file. We will use “example” again. Then, on the “options” change the separator from “;” (semicolon) to just “,” (comma) and click on “save”. That will create the CSV file and save it in your computer desktop.

· You have successfully extracted Facebook posts!!!

1.4 Exploring other parameters

In this section we are going to repeat what we did but this time we will explore other parameters so you can extract more specific data.

· Open Facepager IDE

· Create a New Database and save it as “example” on your desktop.

· Add Nodes. It will request you to write the Object ID of the page or pages you want to fetch. Copy the numeric ID and click OK.

· Now you should see a node in the Object Window.

· Log in to Facebook by clicking on the button from the Parameters Box.

· Now we are going to set up our parameters:

o Click on the dropdown box from Resources.

o Select /<page-id>/posts. *If you don’t have that option, you must go to Presets and select Facebook. That will open some options for you to load and apply to your Facepager. After that, proceed to go back to Resources and select /<page-id>/posts.

o Posts from a specific period of time

Now we are going to set up the dates:

In the box below “limit” write “since” and that will make a default date appears on the right box. Next, write “until” in the box below “since” and that will also have a default date. You can change the date according to your needs.

o Now, since you may need the posts from a large period of time, then you have to change the “limit” to any number no larger than 100 and also you must change the number in the Maximum pages box to more than 20.

· Click on the Fetch Data button

· After the process is complete, click on Expand nodes and you will see the data you extracted. Take a look at the “created_time” column and you should have the posts from the period of time you specified.

· Collecting comments from posts

Now we are going to continue using the same data we have collected, but we are going to click on Presets, Facebook, Get Comments. Click on “Apply”

· In the Parameters box, now you must have the new parameters (time, parent, comment_count, etc.)

· Select only one of the posts (one line of the data) and click on “Fetch Data”.

· You will see the comments to that post right below the line of the post:

· Sometimes the result is “empty” and “offcut” and that means that this specific post does not have any comments. You can try selecting some posts or the complete lists of posts, depending on how many comments you think they will have, and click on “Fetch Data” again.

· You must get the list of comments per post:

· To get the replies to the comments, you must select the comments and “Fetch Data”. That will extract the replies of the selected comments and add another level in the data base. In the image below you may see Level 0 which is the Object (Facebook page) level 1are the comments and level 2 are the replies to comments.

Collecting pictures from posts

· Some Facebook pages, in their privacy settings does not allow you to collect their pictures, but many other pages, do. So, we are going to practice downloading pictures from a Facebook public page: https://www.facebook.com/dreaminmexico.

· First, we will Add nodes (and add the name or numeric id of the new page), and then we are going to set up the parameters to collect the pictures. Click on the small button next to the fields.

A pop-up small window will appear, and you have to add a comma after the last fields in the box and add the word picture. Click on “OK”.

Now you must see the field “picture” in the Parameters box.

Then, select the Object you will fetch (in this case, the level 0) and click on “Fetch Data”. You should see the “picture” column in the Object Window, but in case it is not there, click on data to open the JSON information. There you must see the picture key click on it and “Add Column” That will display the “picture” column in the Object Window. The data in the window does not show you the images, but rather the links to the images. If you want to download the images to your desktop, you must create a folder. Then, open the folder and right-click on the name of the folder and select the Copy address.

Then, go to Facepager and paste the address to the Download box in the Generic Tab of the Parameters box:

Revise if the Base path, which is the first box of the Parameters box contains the word <picture>. If not, write it down. Next, click on “Fetch Data”, that will save the images in your folder.

· Now, explore the data and decide whether you want to keep all the columns or not, or if you want to add more information. For that purpose, click on “data” on one of the lines of the fetched data and on the JSON window will appear all the information that was fetched.

On the Columns Window you will see the name of the columns that are displayed in the Object Window. If you want to add another column, click on the JSON line that contains the name of the information you want to be included and click on Add Colum (right below the window). That will add that category to the displayed columns in the Object Window.

· The information displayed in the Object Window is the one that will be exported to a csv file.

· To export the data to a csv file, click on “Export Data”. That will open a window where you must give a name to the file. We will use the name “example1”. Then, on the “options” change the separator from “;” (semicolon) to just “,” (comma) and click on “save”. That will create the CSV file and save it in your computer desktop.

· Also, we can create a database in Excel which can be more intuitive with the nodes we extracted (parent and child, levels 0 and 1, for example). So we can open a Excel sheet:

· Click on “Data”:

Then click on From text/CSV:

That will open a pop-up window to select the file. Select the one you exported from Facepager and click “Import”.

Then, another window will ask you to select between two options (Upload or Upload in…). Select the second one.

Another window will ask you about the type of table you prefer. Select the one by defalt, which is the first one (Table) and click “Accept” or “Ok”.

Now you have created a Database in Excel that allows you to explore your data by level (node levels). You can select and unselect the information you want to explore. Remember that level 0 is the object (the first line in Facepager); level 1 will display the information of the Facebook page posts; and the level 2 will display the comments to posts.

Congratulations! You have completed Lesson 1! BRAVO!!!

Now, you may proceed to Lesson 2 or you can practice some of what you learned today by going to the Exercise 1.

Lesson 2

Exercise

Extract the posts from the following Facebook page: https://www.facebook.com/NewComienzos
Follow these parameters: Only posts (not comments, not replies, not pictures, only text)
Posts from 01-01-2022 to 01-30-2022
Put a limit of 30, but a maximum page of 20.
Fields to extract: message, from, created_time, and updated_time.
Create a simple csv file and save it as “exercise1”

Questions for you:

How many posts did the association publish?
What is the text of the second post this association published in January and what was the date of publication?
What is the text of the last post they published in January?

Solution

How many posts did the association publish? R= 48 posts
What is the text of the second post this association published in January and what was the date of publication?

R= Happy Nerd Day! #NewComienzos

What is the text of the last post they published in January?

R= ¿Sabías qué...? #NewComienzos #WeAreCommunity #IsraelConcha #EnMéxicoTambiénHaySueños #BornReady

Page updated

Google Sites

Report abuse