CounseLLMe

CounseLLMe: Investigating Human-Large Language Model conversations under the lens of cognitive science and complex systems

SATELLITE WORKSHOP FOR HHAI2025

By bringing together experts from diverse fields represented at HHAI2025, the "CounseLLMe" workshop aims to foster interdisciplinary collaboration and advance our understanding of human-LLM interactions.

Scientific Abstract: The "CounseLLMe" workshop aims to explore the dynamics of interactions between humans and Large Language Models (LLMs) by employing methodologies from Natural Language Processing (NLP), text analysis, and complex network theory. As LLMs become increasingly integrated into various applications, understanding the nuances of these interactions is crucial for enhancing communication efficacy, ensuring ethical standards, and improving user experience. This half-day workshop at HHAI2025 will serve as a platform for researchers and practitioners to present findings, share methodologies, and discuss challenges related to analyzing human-LLM dialogues.

Attendance of this workshop depends on joining the HHAI2025 conference.

Organizers:

Prof. Massimo Stella: Professore su chiamata diretta dall’estero and Senior Researcher at the Department of Psychology and Cognitive Science, University of Trento. PI of CogNosco Lab, his research focuses on cognitive data science, AI psychometrics, and mathematical psychology.
Edith Haim: PhD Candidate at CogNosco Lab, University of Trento. Her work involves applying complex networks techniques and language modelling to capturing creativity in humans and LLMs.
Edoardo De Duro: PhD Candidate at CogNosco Lab, University of Trento. He specializes in LLMs and mental health, working on graphical user interfaces, network psychometrics and text analysis.

CounseLLMe's Topics and Issues:

Analyzing conversational structures between humans and LLMs using NLP techniques.
Applying text analysis methods to assess the quality and coherence of LLM-generated responses.
Utilizing complex network theory to model and understand the flow of information in human-LLM dialogues.
Identifying ethical considerations and biases in human-LLM interactions.
Developing metrics for evaluating the effectiveness of human-LLM communication.
Identifying crucial psychological outcomes for human-LLM conversations.

Intended Duration: Half-day event, 10. June, 9:00-13:00.

Keynote Speakers

Prof. Alessio Palmero Aprosio,
Associate Professor, DIPSCO, Trento University.

Big Data in the era of LLMs: The example of parliamentary debates

Analyzing parliamentary debates holds considerable value across numerous research disciplines. Beyond their evident importance for political science, these datasets offer profound insights into the historical evolution of languages and their related cultural contexts. Over the past two centuries, global society has experienced significant transformations, including the shift from absolute monarchies to democratic governments and the upheavals resulting from two world wars. Parliamentary records meticulously document these critical historical events and capture the broader landscape of political and social developments.

What types of documents do parliamentary debates include? In what ways have these resources been utilized in linguistic and social science studies? The speech provides a comprehensive overview of parliamentary debates, describing their characteristics as documents and highlighting their scholarly relevance. Existing parliamentary corpora from around the world will be surveyed, with an exploration of the methodologies and formats involved in their collection and organization. Special attention will be dedicated to the IPSA corpus, which comprises the parliamentary debates of the Italian Parliament from 1848 to 2022. Through practical examples drawn from the data and associated computational tools, the presentation will illustrate the digitization of original documents using Optical Character Recognition (OCR). Additionally, the speech will discuss techniques employing LLMs to clean and refine these digitized texts, ensuring each parliamentary speech is accurately linked to the corresponding politician by leveraging advancements in Linked Open Data technology.

Edith Haim,
PhD candidate, CogNosco Lab, DIPSCO, Trento University.

Forma mentis networks predict creativity ratings of short texts via interpretable artificial intelligence in human and GPT-simulated raters

Creativity is a fundamental skill of human cognition. We use textual forma mentis networks (TFMN) to extract network (semantic/syntactic associations) and emotional features from approximately one thousand human- and GPT3.5-generated stories. Using Explainable Artificial Intelligence (XAI) we test whether features relative to Mednick’s associative theory of creativity can explain creativity ratings assigned by humans and GPT-3.5. Using XGBoost, we examine 3 scenarios: (i) human ratings of human stories, (ii) GPT-3.5 ratings of human stories, and (iii) GPT-3.5 ratings of GPT-generated stories. Our findings reveal that GPT-3.5 ratings differ significantly from human ratings not only in terms of correlations but also because of feature patterns identified with XAI methods. GPT-3.5 favours “its own” stories and rates human stories differently from humans. Feature importance analysis with SHAP scores shows that: (i) network features are more predictive for human creativity ratings but also for GPT-3.5´s ratings of human stories; (ii) emotional features played a greater role than semantic/syntactic network structure in GPT-3.5 rating its own stories. These quantitative results underscore key limitations in GPT-3.5´s ability to align with human assessments of creativity. We emphasise the need for caution when using GPT-3.5 to assess and generate creative content, as it does not yet capture the nuanced complexity that characterises human creativity.

Edoardo De Duro, PhD candidate, CogNosco Lab, DIPSCO, Trento University.

Introducing CounseLLMe: A dataset of simulated mental health dialogues for comparing LLMs like Haiku, LLaMAntino and ChatGPT against human

We introduce CounseLLMe as a multilingual, multi-model dataset of 400 simulated mental health counselling dialogues between two state-of-the-art Large Language Models (LLMs). These conversations - of 20 quips each - were generated either in English (using OpenAI’s GPT 3.5 and Claude-3’s Haiku) or Italian (with Claude-3’s Haiku and LLaMAntino) and with prompts tuned with the help of a professional in psychotherapy. We investigate the resulting conversations through comparison against human mental health conversations on the same topic of depression. To compare linguistic features, knowledge structure and emotional content between LLMs and humans, we employed textual forma mentis networks, i.e. cognitive networks where nodes represent concepts and links indicate syntactic or semantic relationships between concepts in the dialogues’ quips. We find that the emotional structure of LLM-LLM English conversations matches the one of humans in terms of patient-therapist trust exchanges, i.e. 1 in 5 LLM-LLM quips contain trust along 10 conversational turns versus the 24 rate found in humans. ChatGPT and Haiku’s simulated English patients can also reproduce human feelings of conflict and pessimism. However, human patients display non-negligible levels of anger/frustration that is missing in LLMs. Italian LLMs’ conversations are worse in reproducing human patterns. All LLM-LLM conversations reproduced human syntactic patterns of increased absolutist pronoun usage in patients and second-person, trust-inducing, pronoun usage in therapists. Our results indicate that LLMs can realistically reproduce several aspects of human patient-therapist conversations and we thusly release CounseLLMe as a public dataset for novel data-informed opportunities in mental health and machine psychology.

Accepted Contributed Talks

Katherine Abramski, PhD Candidate, KDDLab - ISTI-CNR and University of Pisa.

The "LLM World of Words" English free association norms generated by large language models

Free associations have been extensively used in psychology and linguistics for studying how conceptual knowledge is organized. Recently, the potential of applying a similar approach for investigating the knowledge encoded in LLMs has emerged, specifically as a method for investigating LLM biases. However, the absence of large-scale LLM-generated free association norms that are comparable with human-generated norms is an obstacle to this research direction. To address this, we create a new dataset of LLM-generated free association norms modeled after the "Small World of Words" (SWOW) human-generated norms with nearly 12,000 cue words. We prompt three LLMs (Mistral, Llama3, and Haiku) with the same cues as those in SWOW to generate three novel comparable datasets, the "LLM World of Words" (LWOW). From the datasets, we construct network models of semantic memory that represent the conceptual knowledge possessed by humans and LLMs. We demonstrate how these datasets can be used for investigating implicit biases in humans and LLMs, such as the harmful gender stereotypes that are prevalent both in society and LLM outputs.

Erica Cau, PhD Candidate, KDDLab - ISTI-CNR and University of Pisa.

Language-Driven Opinion Dynamics in Agent-Based Simulations with LLMs

Humans can form opinions based on their inner emotions, beliefs, and ideas; as inherently social beings, they tend to share and discuss these opinions with others. Online Social Networks (OSNs) have fostered this process, offering a virtual square for debates involving individuals worldwide. However, interactions in OSNs can be affected by human and algorithmic biases, turning them into fertile grounds for polarization and radicalization. The evolution of opinions in recent decades has been studied in the field of computational social sciences by creating agent-based models (ABMs) interacting and changing opinions according to different criteria. Despite the many insights gained, these models still suffer from the limitations inherent in the mathematical approach. Moreover, they do not consider language – a critical, yet underexplored, component in opinion evolution. Recently, LLM-based agent models have been used in opinion dynamics simulations, with promising results in replicating human behaviour. LLM agents simulated echo chamber formation on specific network topologies and opinion fragmentation when prompted to act with a strong confirmation bias. We propose a novel ABM framework in which agents are connected in a network and engage through multiple rounds of discussion. Each agent holds an opinion on a given input statement, quantified on a Likert scale ranging from 0 (strongly disagree) to 6 (strongly agree). Agents engage in rounds of pairwise discussion, where one agent acts as the Discussant, while the other takes the role of Opponent, producing an argument attempting to persuade the Discussant. After the interaction, the Discussant determines whether to adjust its opinion by ±1 or to maintain its stance, without relying on predefined update rules. We simulated a mean-field scenario with 140 agents and examined three initial opinion distributions: uniform, po larized, and unbalanced – with agents holding only negative opinions. Experiments leveraged Mistral-7B and Llama3, prompted to simulate discussions on the Theseus’ Ship paradox – a thought experiment that forced agents into well reasoned discussions, reducing the LLMs’ bias toward scientific truth. The discussion topic was framed positively and negatively, i.e., the boat is the same/is different. Our results show that LLM agents exhibit agreement biases and sycophancy. Moreover, they frequently employ logical fallacies in argumentation. Simulations exhibit a consistent trend: the agents align with the framing of the presented statements and interacting partners, rapidly converging toward con sensus, mostly around the agreement position. While Mistral agents show selective acceptance, favoring agreement with the initial statement, Llama agents demonstrate greater openness to opposing views, resulting in slower but inevitable agreement. Especially with a negatively skewed opinion distribution, agents tend to be more biased toward the negative side without reaching a positive stance. Linguistic analysis further confirms that LLM agents generate and are influenced by logical fallacies. These fallacies significantly impact opinion shifts, with LLM agents proving more susceptible than Mistral. Our findings align with prior work, reinforcing concerns that LLMs can be persuaded by flawed reasoning and may perpetuate misinformation in human-AI interactions.

PROGRAMME:
- Meet-n-greet - 10 min on-site registration
- 20 mins + 10 min Q/A - In-person keynote by an expert in computational linguistics. Total time: 0.5 h.
- Social coffee break (self-organized). Total time: 0.5 h.
- 10 mins + 5 min Q/A each - Live presentations focusing on theoretical/applied conversational frameworks and methodologies, including prompt engineering, text analysis, bias detection in LLMs’, cognitive performance in LLMs and ethical considerations of human-LLM interactions. We can accommodate a maximum of 8 live presentations. Total time: 2h.
- 20 mins + 10 min Q/A - Remote synchronous keynote by an expert in human-LLM interactions. Total time: 0.5 h.
- 30 mins of Panel Discussion with the audience about ethics limitations, future developments and exciting potential future applications of LLMs in human conversations. Total time: 0.5 h.

Abstract Sifting: The organizers will sift scientific abstract and manage acceptance/rejection communications. No posters are envisaged for this event. In case of conflict of interest, the organizers will look for the external help of a selected scientific committee.

Target Audience Size: Approximately 20-40 participants, including researchers, practitioners, and students interested in NLP, complex networks, and human-AI interactions.

Acknowledgements:

CogNosco Lab - Check our research on AI, LLMs and cognitive science on our main website.

CALCOLO - Project funded by Fondazione VRT - CARITRO. Main website here.

Supporting References:

Stella, M., Hills, T. T., & Kenett, Y. N. (2023). Using cognitive psychology to understand GPT-like models needs to extend beyond human biases. Proceedings of the National Academy of Sciences, 120 (43), e2312911120.
De Duro, E. S., Improta, R., & Stella, M. (2024). Introducing CounseLLMe: A dataset of simulated mental health dialogues for comparing LLMs like Haiku, LLaMAntino and ChatGPT against humans. Accepted in Emerging Trends in Drugs, Addictions and Health, Springer.

Page updated

Report abuse