Web scraping

The total length of the videos in this section is approximately 18 minutes. Feel free to do this in multiple sittings! There is also a lot of text to read in this tutorial.

You can also view all the videos in this section at the YouTube playlist linked here.

This lecture was created by Katharine Liang '17 and edited by Fridah Ntika '25

This tutorial mentions the package plotly, which is great for making interactive graphics. Our tutorial on plotly is not posted at the moment, because we are updating it, but feel free to explore plotly on your own.

Introduction

Web scraping is a technique in computer science to extract information from websites. Before understanding how to write code to scrape a website and how that code works, it's helpful to have a basic understanding of how webpages work. The howstuffworks website has a great page on this. However, you do not need to know how to create a webpage of your own in order to scrape data: you may find it helpful to read and refer back to the following definitions.

Webpage Background Definitions

Markup Language: A markup language is used to process and annotate an electronic document with a set of tags. Examples include Latex, XML (Extensible Markup Language), GML (Generalized Markup Language), SGML (Standard Generalized Markup Language), and HTML (HyperText Markup Language). More information here

HTML stands for Hypertext Markup Language and is a language used to write webpages. HTML uses various tags to format the content of the webpage. Having a basic idea of the structure and tags will help you understand how scraping works. Here is an overview of HTML.

CSS stands for Cascading Style Sheets and describes how HTML is displayed on a page. You can revise the graphic design of a document by editing a CSS file rather than changing the HTML. By separating the document content and the presentation, you can have more flexibility when trying to format the page.

More about CSS if you would like: http://www.htmlhelp.com/reference/css/quick-tutorial.html

CSS Selectors (IMPORTANT). CSS selectors are patterns used to select elements of the page you want to design. Here is a reference guide to some different CSS selectors.

XML stands for Extensible Markup Language and is another markup language that is used for storing and transporting data. XML is just information wrapped in tags.

XML and HTML

XML acts as a complement to HTML. On a website, the HTML tags control how the elements on a page are displayed, and XML controls the elements of the data on the page. These elements of data are called nodes.
XML tags, unlike HTML tags, are not predetermined. HTML has tags like <p>, <h1>, and <table> that are standard. In XML, the author defines the tags and the document structure.

XML Tree Structure

XML documents form a tree that starts at a root to branches to leaves.

This page shows examples of an XML tree structure in an XML document.

DOM

DOM stands for Document Object Model and is a cross-platform convention for accessing documents in HTML and XML. The document is organized in a tree structure called the DOM tree.

There are three levels:

Core DOM- for structured document
XML DOM for XML docs
HTML DOM - for HTML documents

Here is the reference page for the DOM

Here is a page about XML DOM nodes

How the browser displays content

Markup language and CSS are converted into the DOM. The content of the document and the style of the document are combined.
The browser displays the content of the DOM.

Ready to start scraping with R?

R is usually not the first language people go to when attempting to scrape a website. But the advantage to using R is that once you have extracted and formatted your data, you can go straight into analyzing the information with ease.

Things To Consider Before You Begin

Ethics: Do you have permission to scrape that website? Remember that the content on a page was written by someone and that it is their intellectual property. Website owners use a /robots.txt file to give instructions to a robot visiting their website telling them whether or not they are allowed to visit. You can still scrape it, but it is not a courteous thing to do. Some other websites, like databases that require a log in have strong security, and in that case, it isn’t even possible.
Static v. Dynamic: Is your website static or dynamic? A static website is usually written in plain HTML. A dynamic site is usually written using a server-side language like JSP(JavaServer Pages) or PHP. Generally, you can tell when a site is dynamic if it is interactive or animated. This makes a huge difference in the method and difficulty of scraping the website.

Packages

If you have tried reading about webscraping in R, you may have encountered a list of different packages. These include RCurl, XML, rvest, RSelenium, scrapeit.core. We are going to talk about how to use rvest and RSelenium. XML is also fairly easy to learn, and you can find materials for it here. There is also a blog post tutorial about XML found here.

rvest: Scraping static websites

Rvest is a package created by Hadley Wickham designed to scrape static web pages.

Rvest package functions

html_nodes(): select node using CSS or XPath. The input is the node you want to select (e.g., if I want a table, the input is "table")
html_attrs(): attributes of the html (html_attr() for a single attribute)
html_txt(): get the content of the HTML node
html_name(): get the tag of the HTML code
html_table(): coerce content into a data frame

How to select nodes for html_nodes()

I find it easier to read CSS selectors rather than XPath selectors.

You can use a google chrome extension called Selector gadget, and it will allow you to generate the CSS selectors for scraping. Click this link to download it, and watch the video on the page to learn how to use it. Make sure to also read the description below the video. The following video assumes you know how to use it.

Download the code file used in the video and follow along.

NewWebscrapingVideo.mov

Question 1: I am trying to find the css selector for the object containing the pie chart and the summarized Population distribution by country in June-July 2025 on Wikipedia's List of countries and dependencies by population webpage.

https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

Using the selector gadget, what is the CSS selector for the pie chart object?

.mw-default-size
.pie25
tr
.smooth-pie-container

Show answer

.smooth-pie-container

Explanations:

A) .mw-default-size - selects the cartogram of the world's population in 2018

B) .pie25 - selects 25% of the pie chart

C) tr - selects all the rows in the table "List of countries and territories by total population."

D) .smooth-pie-container - correctly selects the object with the summarized population distribution and pie chart

In Python, scraping a static website can be done using the Beautiful Soup library. Read more about it here: beautiful-soup-4.readthedocs.io/en/latest/

Stop here for now.

Dynamic scraping: video is on the way

Instead of using Chromote, people also use a package called RSelenium. There are great resources out there to learn how to scrape with Selenium. If you look at the cheatsheet used in the last tutorial to find XPath selectors, you will see there are columns for Selenium.

Here is a blog post by Andrew Brooks about scraping with Selenium. You can also read this page by computer world that talks about RSelenium, and it includes a reference table, webinar video, and a nice introduction.

You're done!

Page updated

Report abuse