Unit II: Scraping Twitter Data [Reading Time: 15 minutes]
What is web scraping?
Introduced in the 1990s, web scraping is the primary way search engines like Google find content based on search queries. It is a process of extracting or scraping unstructured data from web pages using bots or crawlers. This practice can quickly become unethical if one is not paying attention to the terms and conditions of that web page. To avoid this problem, read the terms and conditions, check out if the website has its own API, or just ask for permission from the website owner. If you are interested in learning more about ethical web scraping, check out this link here.
However, before we move on, let's define what an API is and why it is important for scraping Twitter data.
What is an API?
API stands for Application Programming Interface and is used to mediate connections between computers and or computer programs. Unlike user interfaces which are used to mediate interactions between humans and computers, APIs help computers or software programs to communicate with each other.
Twitter and its APIs
Twitter offers a number of APIs to companies, developers, and its users. For the purposes of our tutorial, we are only interested in the standard API that provides us with tweets around a specific search query in the past seven days. If you're interested in getting access to Twitter's full-archive search, I suggest you apply for the Academic Research product track. To learn more about this process, check out this link here.
Let's scrape some Twitter Data!
There are various ways to scrape Twitter data. Some do it through coding using Python, others by hiring a company to scrape the data for them. This tutorial is indebted to the instructions provided by Martin Hawksey, creator of TAGS, and Zach Francis' helpful guide on TAGS. In this tutorial, I am going to show you how to scrape on Twitter, with zero coding.
Brief Background:
I first came across TAGS in 2019 when I was attempting to gather live Twitter data for a research project on Indian nationalism. As someone with no prior experience with coding, TAGS was instrumental in my data collection process. It is a Twitter archiving Google Sheet created by Martin Hawksey, a learning technology advisor, and an "EdTech Explorer". I used TAGS google sheets to scrape tweets from hashtags I was interested in.
The TAGS google sheet accesses the standard Twitter API via your Twitter account. For folks who are new to coding or may not have done coding before, this google sheet has an easy setup that simply requires you to connect your twitter account to the google sheets. The only caveat is that it can retrieve tweets going back as old as seven days and there are limits to how much one can retrieve per hour. Since TAGS operates as an application, it is able to retrieve 1500 requests per 15 minutes. To learn more about how rate limits work for Twitter, check out this link here. However, with some additional tweaking on the TAGS google sheets, users are able to retrieve the maximum amount of queries, which can surpass the 1500 limit. Check out this blog post for more information.
Instructions for setting up TAGS Google Sheet
Pre-requirements: A google/gmail account and a Twitter account
Type the follow web address in your internet browser: https://tags.hawksey.info/get-tags/
Click on "TAGS v6.1" as it has an easy setup. Use the image below as a reference.