Objective of the study
This study addressed recent advancements in Artificial Intelligence by investigating the reliability and validity of ChatGPT, a state-of-the-art large language model (LLM), as an autonomous rater of pragmatic competence by applying the Assessment Battery for Communication (ABaCo)—a clinically validated instrument for assessing pragmatic ability across multiple facets of human communication.
Rationale for using ChatGPT
A key motivation for selecting ChatGPT is its widespread accessibility and adoption among diverse user groups, in contrast to more specialized systems such as Gemini. This distinction is methodologically relevant, as ChatGPT’s open-ended and flexible conversational architecture aligns well with exploratory research aims and facilitates broader generalization of findings. Although this study focuses primarily on ChatGPT, future research may incorporate direct comparisons with alternative AI systems.
Rationale for using ABaCo
The choice of ABaCo is motivated by its binary scoring system and targeted focus on pragmatic ability, which make it particularly suitable for evaluation by current LLMs. Unlike more complex assessment protocols, ABaCo’s structure is readily adaptable to the operational capacities of ChatGPT, enabling precise, dimension-specific coding.
Materials
We utilize a dataset of coded responses currently collected from 21 older adult participants who completed the full ABaCo battery. Each response was independently evaluated by expert human raters and is now coded by ChatGPT, through a specific set of prompts, both adhering strictly to the official clinical coding criteria.
Procedures
For each ABaCo item, both human experts and ChatGPT independently assign binary scores (0 = incorrect, 1 = correct) in accordance with the ABaCo manual, with each relevant dimension coded separately. Inter-rater agreement between human coders and ChatGPT is quantified using both direct percentage agreement and Cohen’s kappa coefficient, thereby adjusting for chance agreement. Additionally, a qualitative analysis is conducted on discrepant cases to systematically identify recurring patterns and underlying causes of disagreement.
The findings contribute to the theoretical understanding of how conversational AI systems process and interpret complex, multi-dimensional communicative acts. The results clarify the boundaries of current LLMs’ inferential, social, and pragmatic capabilities, directly informing debates in psycholinguistics, cognitive science, and human–AI interaction. The study also sets a methodological precedent for integrating clinical assessment frameworks with advanced AI systems, encouraging further exploration into the cognitive plausibility and limitations of LLM-based reasoning.