Sally-Anne “false belief” test” - Simon Baron-Cohen - Borat’s brother
“Sally has a basket. Anne has a box. Sally has a marble. She puts the marble into her basket. Sally goes out for a walk. Anne takes the marble out of the basket and puts it into the box. Now Sally comes back. She wants to play with her marble. Where will Sally look for the marble?” - Human children cannot pass this test until the age of four.
“Theory of mind refers to the capacity to understand other individuals by ascribing mental states to them. A theory of mind includes the understanding that others' beliefs, desires, intentions, emotions, and thoughts may be different from one's own.” Wikipedia
Mechanistic Empathy: Decoding by Contrasting Layers (DoLa) as a Functional Analogue to Inhibitory Control in Theory of Mind
https://github.com/goldspruce/DoLa
Large Language Models (LLMs) frequently struggle with hallucination. This phenomenon occurs when a model prioritizes high-probability statistical associations over factual truth or specific context. Recent advancements in interpretability, specifically Decoding by Contrasting Layers (DoLa), have mitigated this issue by subtracting the logits of early transformer layers from those of later layers (Chuang et al., 2024). This paper proposes that DoLa is not merely a noise-reduction technique. Instead, it functions as an analogue to the cognitive mechanism of inhibitory control required for Theory of Mind (ToM) in human psychology.
In cognitive science, successful ToM performance requires the inhibition of the egocentric or reality-centric perspective to allow the representation of another's mental state to emerge. We argue that the early layers of an LLM function as the reflexive cognitive substrate. These layers encode strong statistical priors that mirror a reality bias, such as the most common association with an object. The later layers, conversely, encode context-dependent reasoning but remain polluted by these initial reflexes. By mathematically subtracting the early-layer logits, DoLa effectively decouples the raw statistical reflexes of the model from its higher-order reasoning.
This subtraction operation mirrors the decoupling mechanism in the human brain, where the inhibition of the default mode allows for the simulation of alternative perspectives. We demonstrate that applying contrastive decoding to ToM tasks in LLMs significantly improves performance on False Belief benchmarks. These findings suggest that hallucination in AI and egocentric bias in humans may share a common structural etiology, which is the failure to inhibit lower-order associations. These findings offer a novel framework for Mechanistic Theory of Mind and posit that empathy in artificial systems may be an emergent property of subtractive processing rather than additive complexity.
Keywords: Large Language Models, Theory of Mind, DoLa, Inhibitory Control, Mechanistic Interpretability, False Belief Task.
Chuang, Y.-S., Dang, Y., Wang, N., & Glass, J. (2024). DoLa: Decoding by contrasting layers improves factuality in large language models. arXiv. https://arxiv.org/abs/2309.03883
Li, X. L., & Liang, P. (2023). Contrastive decoding: Open-ended text generation as optimization. arXiv. https://arxiv.org/abs/2210.15097
Prelec, D., Seung, H. S., & McCoy, J. (2017). A solution to the single-question crowd wisdom problem. Nature, 541(7638), 532–535. https://doi.org/10.1038/nature21054
SAMPLE CODE
Why Harrisburg Wins: Even though Philadelphia got more votes (65%), it performed worse than expected (85%). Harrisburg received fewer votes (35%), but it performed better than expected (10%). The "Surprise" signal reveals the hidden expert knowledge.
Recent research (such as DoLa and Contrastive Decoding) applies this exact logic to Large Language Models (LLMs) to detect hallucinations.
In this analogy, the Layers of the Transformer act as the "Population."
The "Crowd" = Early Layers (e.g., Layer 2 of 32)
The early layers function like the uninformed majority. They rely on "n-gram probability" and superficial associations. When they see "Capital of Pennsylvania," they reflexively activate "Philadelphia" because those words appear together frequently in the training data.
Analogy: This is the "Predicted Vote" (The Baseline/Prior).
The "Expert" = Late Layers (e.g., Layer 32 of 32)
The late layers function like the informed minority. They have processed the full context and logic of the sentence. They activate "Harrisburg" because they have done the reasoning. However, they are still "polluted" by the signals from the early layers.
Analogy: This is the "Actual Vote" (The Mixture).
The "SP" Calculation in AI
To find the truth, DoLa performs a mathematical operation equivalent to Prelec's algorithm:
Conclusion: Just as Prelec subtracts the "Crowd's Expectation" to find the Expert Truth, DoLa techniques subtract the "Early Layer's Reflex" to find the Model's Reasoning. Both methods work by filtering out the "obvious" (but often wrong) statistical noise.
Here is the analogy mapping the Prelec SP Algorithm and Transformer Layers directly to the Sally-Anne Test from psychology.
This analogy works perfectly because the core challenge in all three scenarios is Inhibitory Control: the ability to suppress a strong, obvious signal (Access to Reality Factual Bias) to reveal a subtle, correct signal (Truth/Belief).
Sally puts a marble in the Basket and leaves.
Anne moves the marble to the Box.
Sally returns.
Question: Where will Sally look?
The "Child" Answer (Access to Reality Factual Bias): "The Box" (Because that is where it actually is).
The "Adult" Answer (Theory of Mind): "The Basket" (Because that is where she thinks it is).
In this framework, the Surprisingly Popular (SP) algorithm acts as the cognitive mechanism that allows an AI (or a human) to pass the test.
1. The "Naive" Layer = The Child (Reality Bias)
In Psychology: The child sees the marble in the Box. This signal is overwhelming. They cannot "un-know" reality.
In AI (Early Layers): The model sees the token "Box" associated with the marble's position in the text. The statistical correlation "Marble $\rightarrow$ Box" is extremely high.
In Prelec (The Crowd): The majority sees "Big City $\rightarrow$ Philadelphia." It is the obvious, surface-level answer.
2. The "Expert" Layer = The Confused Adult (Mixed State)
In Psychology: An adult knows the marble is in the Box, but also simulates Sally's mind (Basket). The adult holds both representations.
In AI (Late Layers): The model still knows the "Box" association (it hasn't forgotten the text), but it has computed the "Basket" logic. The probabilities are split (e.g., 60% Basket, 40% Box).
In Prelec (The Expert): The expert knows Harrisburg, but also knows everyone else will pick Philadelphia.
3. The "Subtraction" (SP) = Inhibitory Control
This is the magic step. The algorithm subtracts the Naive signal from the Expert signal.
The "Box" Signal cancels out: Since both the Child and the Adult know the marble is in the Box, subtracting them removes the "Access to Reality Factual Bias."
The "Basket" Signal remains: Only the Adult knows (“undertands”) the reality about the Basket. Therefore, the "Basket" becomes the Surprisingly Popular answer.