Understanding Social Reasoning
in Language Models
with Language Models

Kanishk Gandhi*, Jan-Philipp Franken*, Tobias Gerstenberg, Noah D. Goodman

ABSTRACT

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

Framework for procedural generation

Figure 1: Illustration of our template-based Theory-of-Mind (ToM) scenarios. [a] The causal template and an example scenario including prior desires, actions, and beliefs, and a causal event that changes the state of the environment. [b] Testing Forward Belief inference by manipulating an agent’s percepts. TB = True Belief. FB = False Belief. [c] Forward Action inference from an agent’s percepts which requires additional inferences over unknown beliefs. [d] Backward Belief inference requires joint inferences over unknown percepts and beliefs from an agent’s observed actions. Error bars for human performance represent 95% bootstrapped confidence intervals of the mean.

Method for generating evaluations

Figure 2: [a] Three-stage method for generating evaluations: Building a causal template for the domain (left). Creating a prompt template from the causal graph and populating template variables using a language model (middle). Composing test items by combining template variables (right). [b] Crowdworker ratings of our model-generated Theory-of-Mind (ToM) evaluations compared to crowd-sourced ToM evaluations and expert-written ToM evaluations. Error bars represent 95% bootstrapped confidence intervals of the mean.

Model Performance

Figure 3: Model performance (0-shot) across conditions. [a] Forward Belief inferences from percepts to beliefs. TB = True Belief. FB = False Belief. [b] Forward Action inferences from an agent’s percepts which require additional inferences over unknown beliefs. [c] Backward Belief inferences over unknown percepts and beliefs from an agent’s observed actions. Error bars for humans represent 95% bootstrapped confidence intervals of the mean.

Table of Results

Table 3: Model performance for each method. TB = True Belief. FB = False Belief. † = without initial belief. ‡ = with initial belief.