Cultural Analytics [3CFU]

Prof Carlos Castillo, Dr. Giorgio Barnabò

May 2022

Wikipedia beyond the encyclopedic value

9 May 2022, 10:30-13:30 (room B2); 10 May 2022, 09:00-13:00 (Aula Magna)
Both online and in presence (DIAG department, Sapienza University)

Prof. Diego Saez-Trumper (Wikimedia)

Abstract
In this lecture, we are going to take Wikipedia (and sister projects) as an object of study, as well as a large data repository. We are going to learn the basics of how Wikipedia works and the data that is produced and shared as a result of Wikipedians' contributions and interactions. We will review a set of tools for consuming and processing this data, and also discuss some problems that can be solved using Wikipedia data as well as some open questions on this field. Some of the datasets and tools we will cover during the lecture are the following:

  • Static Dumps: Full Wikipedia dumps, where to get and how to parse them.

  • MediaWiki Utilities: The Python packages to interact with Wikimedia Utilities

  • Wikimedia API: The Wikimedia API for accessing data.

  • Pageviews API: How to check a detailed pageview count for any Wikipedia Page.

  • Quarry: The web interface to interact with Wikimedia SQL servers.

  • Clicks: Explanation of the click dataset (navigation path within Wikipedia).

  • Event Stream: Explanation of the (live) Event stream dataset.

  • Wikidata: How to interact with this (semantic) knowledge base.

  • ORES: Public Machine Learning based quality control systems

We will finalize the lecture with hands-on work interacting with some of the datasets described above.

Slides of the lectures: link.
In order to learn how to use the tools discussed in class, professor Prof. Diego Saez-Trumper has proposed an easy homework that you are advised to complete. A notebook with a practical example of how to perform most of the tasks required in the homework can be found here.

Measuring the Happiness, Health, & Stories of Society through the Sociotechnical Dynamics of Social Media and Fiction

24 May 2022, 15:00-17:00
Online only.

Prof. Chris Danforth (Vermont University),
Prof. Peter Dodds
(Vermont University)

Abstract
This talk will describe a suite of physically inspired instruments we've developed to enable the exploration of large-scale text data, illuminate collective behavioral patterns, and develop a science of stories. Along with our flagship efforts at http://hedonometer.org and https://storywrangling.org, we show how Instagram photos reveal markers of depression prior to formal diagnosis, and Twitter topic dynamics ranked Trump as being more popular than God throughout his presidency. Finally, we present evidence in support of a hypothesis posed by author Kurt Vonnegut, namely that there are only a few emotional arcs (or modes) exhibited by the vast majority of works of fiction.

Slides of the lectures: link.

Goodreads:
a computational Study

27 May 2022, 16:00-18:00.
Online only.

Prof. Melanie Walsh (University of Washington)

Abstract
This lecture examines how Goodreads users define, discuss, and debate "classic" literature by computationally analyzing and close reading more than 120,000 user reviews. We begin by exploring how crowdsourced tagging systems like those found on Goodreads have influenced the evolution of genre among readers and amateur critics, and we highlight the contemporary value of the "classics" in particular. We identify the most commonly tagged "classic" literary works and find that Goodreads users have curated a vision of literature that is less diverse, in terms of the race and ethnicity of authors, than many U.S. high school and college syllabi. Drawing on computational methods such as topic modeling, we point to some of the forces that influence readers’ perceptions, such as schooling and what we call the classic industry - industries that benefit from the reinforcement of works as classics in other mediums and domains like film, television, publishing, and e-commerce (e.g., Goodreads and Amazon). We also highlight themes that users commonly discuss in their reviews (e.g., boring characters) and writing styles that often stand out in them (e.g., conversational and slangy language). Throughout the essay, we make the case that computational methods and internet data, when combined, can help literary critics capture the creative explosion of reader responses and critique algorithmic culture’s effects on literary history.

Media Content Analysis
and Culturomics

30 May 2022, 09:30-13:00 (Aula Magna)
31 May 2022, 09:00-13:00 (room B2)
Both online and in presence (DIAG department, Sapienza University)

Prof. Nello Cristianini (Bristol University)

Abstract
We will review case studies where various types of textual content have been used to reveal insights about cultural aspects of society, as well as the origins of this method. This will include various studies with social media, newspapers, and historical newspapers; it will include studies of UK, US, Italian, and Slovenian historical newspapers. Some attention will also be devoted to the analysis of Wikipedia access and product sales, and book content. The general techniques will be based on simple statistics, but some of the work will involve NLP tools, such as parsers, in order to generate network data. We will also review the cultural roots of the methodology, as it is applied today. Time permitting, we will also address studies aimed at which cultural biases a machine can absorb from the data, giving a new perspective on an old problem in archival science, that is the problem of bias in archival content. Problems facing computer scientists involved in cultural analytics are not traditionally seen in other parts of computer science and are better understood in the humanities.