Skip to main content

Revealing the Hidden Treasures of Parliamentary Proceedings with NLP

Seminar
Nikola Ljubešić
Wednesday, September 11, 2024, 1:00 pm – 2:00 pm

ABSTRACT / Parliaments serve as the cornerstone of democracy, ensuring the political representation of citizens. Despite their empirical relevance, parliamentary studies have often limited their scope to a single parliamentary body or a small group of parliaments analyzed in a comparative perspective. There are two recent developments that offer the opportunity to change this. The first development is the availability of open comparable parliamentary data through the ParlaMint project, covering transcripts of 26 European national parliaments, comprising over 7 million speeches given in more than 20 languages. The second development represents improvements in the area of natural language processing, which allows for automatic and consistent enrichment of vast textual data across languages, as well as significantly improved processing and enrichment of large quantities of speech recordings. In this talk, I will present two follow-up projects of  ParlaMint that exploit this new research landscape.

The first project, ParlaCAP, focuses on enriching each of the 7 million parliamentary speeches, regardless of their language, via multilingual language models. Each text will be automatically labeled with the topic discussed and the sentiment expressed, which will help elucidate differences in agenda setting across the 26 European national parliaments, revisiting the theory of “core issues” receiving prioritized attention in European democracies. This confirmatory work will be further expanded with the question to what extent does tone, primarily negativity in legislative debates, differ between countries and how does this variation in communication relate to agenda diversity.

The second project, ParlaSpeech, focuses on ensuring the availability of the original, spoken modality of parliamentary debates. With their alignment to the textual transcripts, we ensure the availability of large quantities of aligned spoken and textual material from the public domain. I will showcase the usefulness of this data, merged with recent improvements in speech processing, by applying a pre-trained speech model on automatically identifying disfluencies in speech that are not part of the official parliamentary transcripts. This will enable us to revisit an information-theoretic take on the function of disfluencies in spoken communication.

BIO / Nikola Ljubešić is a senior researcher from the Jožef Stefan Institute in Ljubljana. He is also affiliated with the Faculty of Computer and Information Science of the University of Ljubljana, and the Institute of Contemporary History in Ljubljana. His research interests lie in the areas of natural language processing, computational linguistics and computational social science, with a strong focus on the South-Slavic linguistic and cultural area. His current research foci are benchmarking of large language models, their adaptation to language variation, and their exploitation in answering research questions on large data collections.