Web scraping

The total length of the videos in this section is approximately 18 minutes. Feel free to do this in multiple sittings! There is also a lot of text to read in this tutorial.

You can also view all the videos in this section at the YouTube playlist linked here.

This lecture was created by Katharine Liang '17, so that's whose voice you will hear. Some of these videos have made her famous on youtube, in our own narrowly focused way.

This tutorial mentions the package plotly, which is great for making interactive graphics. Our tutorial on plotly is not posted at the moment, because we are updating it, but feel free to explore plotly on your own.

Introduction

Web scraping is a technique in computer science to extract information from websites. Before understanding how to write code to scrape a website and how that code works, it's helpful to have a basic understanding of how webpages work. The howstuffworks website has a great page on this. However, you do not need to know how to create a webpage of your own in order to scrape data: you may find it helpful to read and refer back to the following definitions.

Webpage Background Definitions

Markup Language: A markup language is used to process and annotate an electronic document with a set of tags. Examples include Latex, XML Extensible Markup Language, GML Generalized Markup Language, SGML Standard Generalized Markup Language, and HTML HyperText Markup Language. More information here

HTML stands for Hypertext Markup Language and is a language used to write webpages. HTML uses various tags to format the content of the webpage. Having a basic idea of the structure and tags will help you understand how scraping works. Here is an overview of HTML.

CSS stands for Cascading Style Sheets and describes how HTML is displayed on a page. You can revise the graphic design of a document by editing a CSS file rather than changing the HTML. By separating the document content and the presentation you can have more flexibility when trying to format the page.

More about CSS if you would like: http://www.htmlhelp.com/reference/css/quick-tutorial.html

CSS Selectors (IMPORTANT). CSS selectors are patterns used to select elements of the page you want to design. Here is a reference guide to some different CSS selectors.

XML stands for Extensible Markup Language and is another markup language that is used for storing and transporting data. XML is just information wrapped in tags.

XML and HTML

  • XML acts as a complement to HTML. On a website, the HTML tags control how the elements on a page are displayed, and XML controls the elements of the data on the page. These elements of data are called nodes.

  • XML tags, unlike HTML tags, are not predetermined. HTML has tags like <p>, <h1>, <table> that are standard whenever you code in HTML. In XML, the author defines the tags and the document structure.

XML Tree Structure

XML documents form a tree that starts at a root to branches to leaves.

This page shows examples of an XML tree structure in an XML document.

DOM

DOM stands for Document Object Model and is a cross-platform convention for accessing documents in HTML and XML. The document is organized in a tree structure called the DOM tree.

There are three levels:

  1. Core DOM- for structured document

  2. XML DOM-for XML docs

  3. HTML DOM - for HTML documents

Here is the reference page for DOM

Here is a page about XML DOM nodes

How the browser displays content

  1. Markup language and CSS converted into DOM. The content of the document and the style of the document are combined.

  2. Browser displays content of DOM.


Ready to start scraping with R?

R is usually not the first language people go to when attempting to scrape a website. But the advantage to using R is that once you have extracted and formatted your data you can go straight into analyzing the information with ease.

Things To Consider Before You Begin

  1. Ethics: Do you have permission to scrape that website? Remember that the content on a page was written by someone and that it is their intellectual property. Website owners use a /robots.txt file to give instructions to a robot visiting their website telling them whether or not they are allowed to visit. You can still scrape it but it is not a courteous thing to do. Some other websites like databases that require a log in have strong security and in that case it isn’t even possible.

  2. Static v. Dynamic: Is your website static or dynamic? A static website is usually written in plain HTML. A dynamic site is usually written using a server-side language like JSP(JavaServer Pages) or PHP. Generally you can tell when a site is dynamic if it is interactive or animated. This make a huge difference in the method and difficulty to scraping the website.

Packages

If you have tried reading about webscraping in R you may have encountered a list of different packages. These include RCurl, XML, rvest, RSelenium, scrapeit.core. We are going to talk about how to use rvest and RSelenium. XML is also fairly easy to learn and you can find materials for it here. There is also a blog post tutorial about XML found here.

rvest: Scraping static websites

Rvest is a package created by Hadley Wickham designed to scrape static web pages.

Rvest package functions

  • html_nodes(): select node using css or xpath, the input is the node you want to select. (eg if I want a table the input is "table")

  • html_attrs(): attributes of the html (html_attr() for a single attribute)

  • html_txt(): get the content of the html node

  • html_name() : get tag of html code

  • html_table() : coerce content into a data frame

How to select nodes for html_nodes()

I find it easier to read CSS selectors rather than XPath selectors.

Can use a google chrome extension called Selector gadget and will allow you to generate the CSS selectors for scraping. Click this link to download it and watch the video on the page to learn how to use it. Please get selector gadget before watching the tutorial and make sure you watch the video that explains how to use it. The following video assumes you know you how to use it.


Please download the code file used in the video and follow along.

Webscraping.1.Tutorial.mp4

Question 1: I am trying to find the css selector for the table containing all the titles, rankings, ratings, and number of reviews for all the movies on Rotten Tomatos Top 100 Documentary Movies List.

(July 10, 2022: just learned that the link below disappeared - ignore this question for now! We're figuring out a good substitute.)

https://www.rottentomatoes.com/top/bestofrt/top_100_documentary_movies/

Using selector gadget, what is the css selector for the table of the Top 100 documentary movies?

  • .allow-overflow , .panel-heading

  • td

  • .articleLink

  • th , #top_movies_main td

Show answer

th , #top_movies_main td

Explanations:

A) .allow-overflow , .panel-heading is the css selector for the whole panel which the table is inside of.

B) td selects all the tables on the page, including the table with the certified fresh in theaters.

C) .articleLink just selects the list of titles

D) th , #top_movies_main td correctly just selects the table with the Top 100 documentary movies

Scraping dynamic websites

The last tutorial covered web scraping static websites using rvest. This next one is going to show you how to get around scraping a dynamic webpage using rvest and something called phantomjs. You can download phantomJS here. Once you unzip the download you need to save the Phanomjs.exe in your working directory. If you have trouble or want to know what it is you can read about it here.

The R file for the next video is found here Scraping_Java_based_pages_function.R. The cheatsheet for the selectors is found here. I used the second chart on the page for the tutorial.

Warning: This is not meant to be an easy tutorial and the R file is not meant to teach Java or how phatomJs really works. You can read more about phantomJS here if the code inside the writeLines function bothers you.

The cheatsheet for the selectors is found here. In the video below, I used the second chart on the page for the tutorial.

Warning: This is not meant to be an easy tutorial and the R file is not meant to teach Java or how phatomJs really works. You can read more about phantomJS here if the code inside the writeLines function bothers you.


Please download the code file used in the video and follow along. The site scraped in the video may have changed, so don't worry if you can't reproduce all of the steps.

Webscraping.2.PhantomJS.mp4

Instead of using phantomJS people have traditionally used a package called RSelenium. There are great resources out there to learn how to scrape with selenium. If you look at the cheatsheet used in the last tutorial to find xpath selectors you will see there are columns for selenium.

Here is a blog post by Andrew Brooks about scraping with Selenium. You can also read this page by computer world that talks about RSelenium and it includes a reference table, webinar video, and a nice introduction.


You're done!