Manuscript submitted for review
Abstract: Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperatures significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic) significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing leads to robust improvements in coding accuracy. Single agents matched or outperformed the MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when the temperature was 0.5 or lower, and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.
Link: http://arxiv.org/abs/2507.11198
Project Link: https://github.com/conradborchers/llm-ta-consensus
Long Paper at AIED conference 2025, ML track
Abstract: Large Language Models (LLMs) have demonstrated fluency in text generation and reasoning tasks. With recent advances in chain-of-thought and agent-based systems enabling complex reasoning and task execution, AIED research has probed the ability of LLMs to automate qualitative analysis, including thematic analysis, previously achieved through human reasoning only. Complex, manual, and time-consuming variants, such as inductive thematic analysis (iTA) also seem to be within reach of automation through LLMs, though studies using LLMs for iTA have yielded mixed results so far. Previous work especially lacks methodological standards for assessing the reliability and validity of LLM-derived iTA themes and outcomes. Therefore, in this paper, we propose a method for assessing the quality of automated iTA systems based on consistency with human coding and contribute a benchmark dataset for such an evaluation. We employ an expert blind-review approach to compare two iTA outputs: one conducted by domain experts, and another fully automated with an agent-based system built on the Claude 3.5 Sonnet LLM. We present a discussion about the consequences of the method. Results indicate consistency of output between automated systems and manual iTA rated by a team of four expert researchers on a highly domain-specific dataset of CSCL definitions. Our findings contribute evidence that LLMs can enhance or partially automate labor-intensive iTA tasks common in AIED research and beyond.
Link: https://doi.org/10.35542/osf.io/ez8wc_v1
OSF Project (data & analysis): https://doi.org/10.17605/OSF.IO/FBW3N
Long Paper at ISLS 2025, CSCL track
Abstract: The maturity of scientific communities can be measured by their degree of alignment on a common conceptual vision to guide research efforts. Recently, scholars in the CSCL community have called for increased efforts to establish shared theoretical frameworks to accelerate progress in the field of CSCL. The purpose of this study is to investigate if the CSCL community demonstrates alignment on key concepts, such as CSCL and collaborative learning. We conducted a survey with a sample of 50 CSCL scholars, prompting them to respond to open-ended questions related to key concepts in the field and future directions for the community. Findings revealed that while broad agreement exists on the importance of collaborative processes, definitions, and interpretations of these key concepts diverge substantially, highlighting conceptual fragmentation. By identifying the extent of conceptual alignment and contention, this study offers a foundation for building robust theoretical frameworks to advance collective progress in CSCL research.
Link: https://doi.org/10.35542/osf.io/pxrdf_v2
OSF Project (data & analysis): https://doi.org/10.17605/OSF.IO/6EF3J
Preregistration: https://doi.org/10.17605/OSF.IO/ZMFQ6
Workshop Paper at LAK 2025, From Data to Discovery Workshop
Abstract: Thematic analysis (TA) is a method used to identify, examine, and present themes within data. TA is often a manual, multistep, and time-intensive process requiring collaboration among multiple researchers. TA's iterative subtasks, including coding data, identifying themes, and resolving inter-coder disagreements, are especially laborious for large data sets. Given recent advances in natural language processing, Large Language Models (LLMs) offer the potential for automation at scale. Recent literature has explored the automation of isolated steps of the TA process, tightly coupled with researcher involvement at each step. Research using such hybrid approaches has reported issues in LLM generations, such as hallucination, inconsistent output, and technical limitations (e.g., token limits). This paper proposes a multi-agent system, differing from previous systems using an orchestrator LLM agent that spins off multiple LLM sub-agents for each step of the TA process, mirroring all the steps previously done manually. In addition to more accurate analysis results, this iterative coding process based on agents is also expected to result in increased transparency of the process, as analytical stages are documented step-by-step. We study the extent to which such a system can perform a full TA without human supervision. Preliminary results indicate human-quality codes and themes based on alignment with human-derived codes. Nevertheless, we still observe differences in coding complexity and thematic depth. Despite these differences, the system provides critical insights on the path to TA automation while maintaining consistency, efficiency, and transparency in future qualitative data analysis, which our open-source datasets, coding results, and analysis enable.