1st Workshop on Validating Social Text-as-Data Methods (VaLiSTAD)
23 June 2025 @ ICWSM (Copenhagen)
💡 Keynote: Valerie Hase
23 June 2025 @ ICWSM (Copenhagen)
💡 Keynote: Valerie Hase
Note the important changes!
🚨 Workshop time: 13.00 to 17.00. The joint workshop with R2CSS was not possible for logistics reasons :(
📍 Location: Room 2.3.015 (AAU), A. C. Meyers Vænge 15, 2450, Building A, København, Denmark
New Schedule:
01:00-01:10 Welcome
01:10-01:50 Invited Talk by Valerie Hase
01:50-02:20 Talk by Paramita Ray
02:20-2:50 Talk by Joachim Baumann
02:50-3:10 Coffee break ☕️
03:10-04:10 Invited Session by Taimoor Khan, Lorraine Saju, and Arnim Bleier [LINK]
04:10-04:55 Open Session on the Open Challenges of Validation in TADA methods
04:55-05:00 Closing
Invited talk: Valerie Hase (LMU Münich)
At the Intersection of NLP and Social Science: Moving forward with Text-as-Data Methods
Computational methods – such as web scraping for collecting digital traces and natural language processing (NLP) for their classification – offer great promise. Yet, recent work has raised important concerns about quality issues in computational social science (CSS), for example regarding reproducibility, replicability, and validity as key dimensions of quality. In this talk, I discuss these concerns with a focus on NLP methods. I derive recommendations for defining, assessing, and improving the quality of CSS – including what we know from existing research, which methods and tools are available, and where gaps remain.
Invited Session: Taimoor Khan, Lorraine Saju, and Arnim Bleier (GESIS)
Online Computational Reproducibility
Reproducibility is the bedrock of credible and transparent research in the social sciences and beyond. In this one-hour, interactive talk and live demo, we will unpack the key concepts, common pitfalls, and hands-on practices of computational reproducibility. You’ll tour the GESIS Methods Hub—a curated repository of executable research workflows—and see how its analyses can be launched seamlessly on the cloud via Jupyter4NFDI and MyBinder.org. Whether you’re new to reproducible research or looking to streamline your existing workflows, this session will equip you with practical tools and real-world examples to make your code and data openly reusable.
Motivation
The rapid growth of text data from web and social media platforms has transformed research across disciplines and has accelerated the field of Computational Social Science. Yet, the reliability and validity of text-as-data methods remain critical challenges. To explore frameworks, techniques, and best practices for validating the tools and models used to analyze social text data, we are hosting the 1st Workshop on Validating Social Text-as-Data Method (ValiSTAD).
This hands-on workshop brings together scholars and practitioners to discuss a range of topics including quantitative, qualitative, and mixed-methods validation techniques, addressing bias and fairness, and evaluating methods across diverse contexts and languages. Through interactive sessions, case studies, and panel discussions, the workshop aims to advance rigorous, ethical, and reproducible approaches to social text analysis. The main goal of the workshop is to bring together researchers and ideas from computational linguistics/Natural Language Processing (NLP) and the text-as-data community and social science community to foster collaboration and catalyze further interdisciplinary validation efforts between these communities.
Call for papers
We solicit summaries of already published papers and, especially, work-in-progress and early-stage research in this non-archival workshop. Each submission will receive feedback from 2-3 members of the program committee who are experts in Computational Social Science.
Submission format
We solicit either extended abstracts up to 2 pages in length (2 columns, one extra page for figures and tables, plus unlimited references) using the ICWSM template (Overleaf Link). We recommend papers to have some preliminary results, though other types of contributions such as position and perspective papers are also welcome. Papers should be anonymized, therefore, please anonymize your submission: do not put the author names or affiliations at the start of the paper, and do not include funding or other acknowledgments in papers submitted for review. References to authors' own prior relevant work should be included, but should not specify that this is the authors' own work.
Workshop Themes. The main theme of this workshop is on the validation practices employed by the CSS community when using Text-as-data methods. However, we wish to facilitate discussion beyond the methods researchers use, but to also reflect on the broader these methods are embedded in, particularly the data they use and are applied on. We aim to solicit papers and discuss the following topics and questions:
Data: How do we incentivize and facilitate the study of marginalized and hard to reach populations? Is web and social media data helpful in accessing such populations? How do we validate synthetic data, especially LLM-generated data?
Platforms: For which contexts or use cases is research with social media data appropriate and augments traditional social data sources? How much can we trust social media data that we get from sources like X, Facebook, etc? When measuring variables like political stance, can we really control the sampling? How can we build provenance estimates of potentially biased social media samples?
Models: When using computational methods in CSS, e.g., automated text classification techniques using recent Large Language Models (LLMs), how do we account for sociodemographic biases? How do we audit non-open-source LLMs, including open-weight LLMs like LLaMa which do not divulge their full training data?
Reproducibility: Given the ephemeral nature of social media data, how do we ensure that our datasets are reusable? How do we ensure reproducibility when using stochastic LLMs?
Substantively, the workshop is interested in, but not limited to, the following topics:
Studying political communication with NLP and Text-as-data methods (e.g. topic classification, position measurement)
Modeling and validating complex social constructs (e.g. populism, polarization, identity) with NLP methods
Auditing social biases in Text-as-data methods
Validation guidelines and checklists for CSS+NLP
Important dates
🔜 Submission deadline: May 5, 2025
Notification of acceptance: May 21, 2025
Final papers due: June 6, 2025
Workshop: June 23, 2025
Submission site:
https://openreview.net/group?id=icwsm.org/ValiSTAD/2025/Conference
Organizers
(contact) Christopher Klamm is an interdisciplinary PhD student at the University of Mannheim and researcher at the Cologne Center of Comparative Politics. His research efforts and interests are in the areas of NLP and Computational Political Science, with a focus on automatic rhetoric and framing analysis.
Indira Sen is a junior professor at the University of Mannheim. Her research is about understanding and characterizing the measurement quality of social science constructs like political attitudes and abusive content from digital traces, combining NLP and social science measurement theory.
Gabriella Lapesa is a junior professor for Responsible Data Science and Machine Learning at the Heinrich-Heine University D usseldorf and a team lead for Data Science Methods in the Department of Computational Social Science at the Leibniz Institute for the Social Sciences (GESIS) in Cologne.
Simone Paolo Ponzetto holds the chair of Information Systems III (Enterprise Data Analysis) at the University of Mannheim, where he leads the Natural Language Processing and Information Retrieval group. His research interests include research on text understanding and its interdisciplinary application in the Social Sciences and Humanities.
Ines Rehbein is a postdoctoral researcher in the Data and Web Science Group at the University of Mannheim, working with Prof. Dr. Simone Paolo Ponzetto on topics related to the development and application of Natural Language Processing methods to research questions in the Computational Social Sciences.
Dennis Assenmacher is a postdoctoral researcher in the Department of Computational Social Science at the Leibniz Institute for the Social Sciences (GESIS). His research focuses on harmful communication in online media (hate speech, abusive language, dehumanization, social bots), more specifically on developing state-of-the-art computational methods in the NLP domain to this harmful content.
Arnav Arora is an interdisciplinary PhD researcher at the University of Copenhagen. His research focuses on online safety and everaging NLP to analyze misinformation, media framing, LLM interpretability and AI ethics.
Sponsorship