PARLASPEECH TUTORIAL @cs2Italy
When. Wednesday, 15 January, 2025
Where. Department of Sociology and Social Research, Via Giuseppe Verdi 26, Trento (Italy)
Motivation
Analyzing parliamentary debates holds significant importance across various research domains. Beyond its relevance in political science, this kind of datasets offers valuable insights into how a language and its associated culture have evolved throughout history. In particular, over the past two centuries, the world society has undergone profound transformations. Beginning with the shift from absolute monarchies to democracies, passing through two world wars, the humanity has witnessed a series of pivotal historical events. Most of these crucial milestones, as well as the broader spectrum of political and social life, are chronicled within the parliamentary records.
Many research groups worldwide have developed and shared datasets of political debates in different languages, spanning diverse areas of study such as religion [3], gender [11], multilinguality [2], and more.
Two notable datasets, GerParCor [1] and IPSA [6], comprise respectively German and Italian parliamentary records spanning three centuries and five nations. Similarly, siParl [10], DutchParl [8], and the Polish Parliamentary Corpus [9] represent collections of political debates in Slovenian, Dutch, and Polish languages, respectively. Since the establishment of the European Union, the political debates of the European Parliament have been accessible in multiple languages, offering a valuable resource for machine translation [7].
Furthermore, most of recent data already available in electronic format is gathered in ParlaMint [4], a big collection of comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, containing over 1 billion words. The CLARIN research infrastructure, in charge of maintaining and updating ParlaMint, also organises the ParlaCLARIN workshop [5], that is usually co-located with the International Conference on Language Resources and Evaluation (LREC).
The tutorial
This tutorial will first make an overview on what parliamentary debates contains in terms of documents, and how they have been used in linguistics and social science studies in the past. Then, we will list and analyse the existing parliamentary corpora that are available around the world and how they are collected, both in terms of methodology and formats. Finally, we will focus on the extraction of the corpus that cover the biggest span of time, IPSA, containing the debates from the Italian Parliament from 1848 to 2022. Using examples from the data and tools, we will show how the original documents are parsed using OCR, and the techniques adopted to clean up the resulting texts and to associate each speech to the corresponding politician, used the power of Linked Open Data.
Alessio Palmero Aprosio
Alessio Palmero Aprosio is associate professor at University of Trento, Italy. Until June 2024 he was senior technologist at Fondazione Bruno Kessler in Trento (where now he is fellow researcher), part of the Digital Humanities unit. In the past, he studied Mathematics at University of Pavia, and in 2014 he had his PhD in Information Technology at University of Milan. He is currently working on text simplification, hate speech recognition, and analysis and classification of legal documents.
References
Giuseppe Abrami, Mevl¨ut Bagci, Leon Hammerla, and Alexander Mehler. German Parliamentary Corpus (GerParCor). In Proceedings of the Language Resources and Evaluation Conference, pages 1900–1906, Marseille, France, June 2022. European Language Resources Association.
Paul Bayley. Cross-cultural perspectives on parliamentary discourse. Cross-Cultural Perspectives on Parliamentary Discourse, pages 1–390, 2004.
Jennifer E Cheng. Islamophobia, muslimophobia or racism? Parliamentary discourses on islam and muslims in debates on the minaret ban in Switzerland. Discourse & Society, 26(5):562–586, 2015.
Tomaz Erjavec et al. Multilingual comparable corpora of parliamentary debates ParlaMint 4.0, 2023. Slovenian language resource repository CLARIN.SI.
Darja Fiser, Maria Eskevich, and David Bordon. Proceedings of the IV workshop on creating, analysing, and increasing accessibility of parliamentary corpora (ParlaCLARIN)@ LREC-COLING 2024. In Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN)@ LREC-COLING 2024, 2024.
Valentino Frasnelli and Alessio Palmero Aprosio. There’s something new about the Italian parliament: The IPSA corpus. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16037–16046, Torino, Italia, May 2024. ELRA and ICCL.
Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand, September 13-15 2005.
Maarten Marx and Anne Schuth. DutchParl. The parliamentary documents in Dutch. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA).
Maciej Ogrodniczuk and Bartlomiej Nito´n. New developments in the Polish parliamentary corpus. In Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, May 2020. European Language Resources Association.
Andrej Pancur and Tomaˇz Erjavec. The siParl corpus of Slovene parliamentary proceedings. In Proceedings of the Second ParlaCLARIN Workshop, pages 28–34, Marseille, France, May 2020. European Language Resources Association.
Aglaia Paoletti. La presenza femminile nelle assemblee parlamentari: Per un’analisi comparata. Il Politico, 56(1 (157)):77–96, 1991.
More information: a.palmeroaprosio@unitn.it