Unit II: Scraping Twitter Data [Reading Time: 15 minutes]

What is web scraping?

Introduced in the 1990s, web scraping is the primary way search engines like Google find content based on search queries. It is a process of extracting or scraping unstructured data from web pages using bots or crawlers. This practice can quickly become unethical if one is not paying attention to the terms and conditions of that web page. To avoid this problem, read the terms and conditions, check out if the website has its own API, or just ask for permission from the website owner. If you are interested in learning more about ethical web scraping, check out this link here.

However, before we move on, let's define what an API is and why it is important for scraping Twitter data.

What is an API?

API stands for Application Programming Interface and is used to mediate connections between computers and or computer programs. Unlike user interfaces which are used to mediate interactions between humans and computers, APIs help computers or software programs to communicate with each other.

Twitter and its APIs

Twitter offers a number of APIs to companies, developers, and its users. For the purposes of our tutorial, we are only interested in the standard API that provides us with tweets around a specific search query in the past seven days. If you're interested in getting access to Twitter's full-archive search, I suggest you apply for the Academic Research product track. To learn more about this process, check out this link here.

Let's scrape some Twitter Data!

There are various ways to scrape Twitter data. Some do it through coding using Python, others by hiring a company to scrape the data for them. This tutorial is indebted to the instructions provided by Martin Hawksey, creator of TAGS, and Zach Francis' helpful guide on TAGS. In this tutorial, I am going to show you how to scrape on Twitter, with zero coding.

Brief Background:

I first came across TAGS in 2019 when I was attempting to gather live Twitter data for a research project on Indian nationalism. As someone with no prior experience with coding, TAGS was instrumental in my data collection process. It is a Twitter archiving Google Sheet created by Martin Hawksey, a learning technology advisor, and an "EdTech Explorer". I used TAGS google sheets to scrape tweets from hashtags I was interested in.

The TAGS google sheet accesses the standard Twitter API via your Twitter account. For folks who are new to coding or may not have done coding before, this google sheet has an easy setup that simply requires you to connect your twitter account to the google sheets. The only caveat is that it can retrieve tweets going back as old as seven days and there are limits to how much one can retrieve per hour. Since TAGS operates as an application, it is able to retrieve 1500 requests per 15 minutes. To learn more about how rate limits work for Twitter, check out this link here. However, with some additional tweaking on the TAGS google sheets, users are able to retrieve the maximum amount of queries, which can surpass the 1500 limit. Check out this blog post for more information.

Instructions for setting up TAGS Google Sheet

Pre-requirements: A google/gmail account and a Twitter account

Type the follow web address in your internet browser: https://tags.hawksey.info/get-tags/
Click on "TAGS v6.1" as it has an easy setup. Use the image below as a reference.

Screenshot of TAGS homepage which shows a twitter background and google spreadsheet. Below this are two options. One that says TAGS v6.0 and the other says TAGS v6.1. The TAGS v6.1 is circled in red.

3. It will redirect you to a Google Sheets page and ask you if you would like to make a copy of TAGS v6.1.9.1? Click on "make a copy"

4. You will be taken to a Google sheet that looks something like the image below.

A TAGS google sheet and the TAGS option on the top is circled in red

5. Click on TAGS and select the "setup twitter access"

The image shows that the TAGS drop down menu is selected and then the cursor is hovering over setup twitter access option.

6. A popup will appear asking you to authorize the TAGS script. Press continue.

7. Upon pressing continue, you will be redirected to a page where Google will ask you to choose the google account you are interested in letting TAGS have access to. Select the google account you have created, and a popup box will tell you that the app is not verified. [TAGS is currently working to get the app verified under Google, but in case you are worried, I suggest checking out Martin Hawksey's detailed about page on TAGS and its gradual evolution since 2010.]

8. In this popup box, select advanced and click on go to TAGS v6.1 Client (unsafe)

9. This will take you to another page that will ask you to authorize the TAGs app. Click allow. See the image below for reference.

10. Once you click allow another popup box will appear asking you to authorize Twitter. Select the easy setup option.

11. This may once again prompt Google to ask again for authorization of TAGS to interact with your Google account. If this happens click review permission and then allow again,

12. Once you are done authorizing, you should be taken to a page that asks you to login to your Twitter account. See the image below for reference.

13. Once you are signed in, the TAGS google sheet should have connected with your Twitter account. If you click on TAGS once again on your Google Sheet, it should have the option to disconnect twitter access. This is a good way to check if you have successfully linked the sheet to your Twitter account or not.

The TAGS dropdown menu on Google Sheets. It has a variety of options such as Setup Twitter access, Disconnect Twitter Access, Run Now! and many more.

14. In the ninth row of the google sheet, click on the Search tab and enter any search term you are interested in getting tweets about. I suggest checking out the advanced settings tabs if you want to change the default settings. If you get stumped, check out the support forum.

The 9th row of the TAGS google sheet where user are prompted to enter their search terms into the text box

15. Then go back to TAGS and select Run Now

The TAGS dropdown menu where the Run Now option is selected

16. The google sheets workbook will now start collecting tweets based on your search word in the Archive sheet. If you are interested in continually scraping tweets for a particular period of time, I suggest hitting the update archive every hour button as well.

17. To export the tweets you have collected, just download the google sheet as a .csv or excel spreadsheet onto your local computer. Now you can experiment with this data by inputting this in any computational text analysis software like Voyant Tools or perform simple content analysis on your google sheet itself. However, I recommend that you first clean up your scraped Twitter data on OpenRefine, so, it is easier for you to work with moving forward.

Other Resources

If you are not interested in scraping Twitter data and want to work with existing Twitter datasets, I recommend you check out this fantastic tutorial by the Programming Historian: Beginner's Guide to Twitter Data

If you are feeling limited by TAGS easy setup and want to have more control over your Twitter scraping, I recommend checking out Python Libraries such as Tweepy that help assist with Twitter scraping. Here is a guide that shows you how. Keep in mind, it does require you to get a developer account.

Lastly, if you are new to data cleaning and are feeling overwhelmed by the sheer amount of information you have scraped using TAGS, then check out this tutorial by Programming Historian on Open Refine: Cleaning data with openrefine

Review [Runtime 5 mins]

Spend five minutes perusing the following site: https://socialmediadata.org/

Ethical Research Conduct