PARLASPEECH TUTORIAL @cs2Italy

When. Wednesday, 15 January, 2025

Where. Department of Sociology and Social Research, Via Giuseppe Verdi 26, Trento (Italy)

 

Motivation

Analyzing parliamentary debates holds significant importance across various research domains. Beyond its relevance in political science, this kind of datasets offers valuable insights into how a language and its associated culture have evolved throughout history. In particular, over the past two centuries, the world society has undergone profound transformations. Beginning with the shift from absolute monarchies to democracies, passing through two world wars, the humanity has witnessed a series of pivotal historical events. Most of these crucial milestones, as well as the broader spectrum of political and social life, are chronicled within the parliamentary records.

Many research groups worldwide have developed and shared datasets of political debates in different languages, spanning diverse areas of study such as religion [3], gender [11], multilinguality [2], and more.

Two notable datasets, GerParCor [1] and IPSA [6], comprise respectively German and Italian parliamentary records spanning three centuries and five nations. Similarly, siParl [10], DutchParl [8], and the Polish Parliamentary Corpus [9] represent collections of political debates in Slovenian, Dutch, and Polish languages, respectively. Since the establishment of the European Union, the political debates of the European Parliament have been accessible in multiple languages, offering a valuable resource for machine translation [7].

Furthermore, most of recent data already available in electronic format is gathered in ParlaMint [4], a big collection of comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, containing over 1 billion words. The CLARIN research infrastructure, in charge of maintaining and updating ParlaMint, also organises the ParlaCLARIN workshop [5], that is usually co-located with the International Conference on Language Resources and Evaluation (LREC).

 

The tutorial

This tutorial will first make an overview on what parliamentary debates contains in terms of documents, and how they have been used in linguistics and social science studies in the past. Then, we will list and analyse the existing parliamentary corpora that are available around the world and how they are collected, both in terms of methodology and formats. Finally, we will focus on the extraction of the corpus that cover the biggest span of time, IPSA, containing the debates from the Italian Parliament from 1848 to 2022. Using examples from the data and tools, we will show how the original documents are parsed using OCR, and the techniques adopted to clean up the resulting texts and to associate each speech to the corresponding politician, used the power of Linked Open Data.

  

Alessio Palmero Aprosio

Alessio Palmero Aprosio is associate professor at University of Trento, Italy. Until June 2024 he was senior technologist at Fondazione Bruno Kessler in Trento (where now he is fellow researcher), part of the Digital Humanities unit. In the past, he studied Mathematics at University of Pavia, and in 2014 he had his PhD in Information Technology at University of Milan. He is currently working on text simplification, hate speech recognition, and analysis and classification of legal documents.

  

 References

  More information: a.palmeroaprosio@unitn.it