Sally-Anne “false belief” test” - Simon Baron-Cohen - Borat’s brother
“Sally has a basket. Anne has a box. Sally has a marble. She puts the marble into her basket. Sally goes out for a walk. Anne takes the marble out of the basket and puts it into the box. Now Sally comes back. She wants to play with her marble. Where will Sally look for the marble?” - Human children cannot pass this test until the age of four.
“Theory of mind refers to the capacity to understand other individuals by ascribing mental states to them. A theory of mind includes the understanding that others' beliefs, desires, intentions, emotions, and thoughts may be different from one's own.” Wikipedia
Mechanistic Empathy: Decoding by Contrasting Layers (DoLa) as a Functional Analogue to Inhibitory Control in Theory of Mind
Large Language Models (LLMs) frequently struggle with hallucination. This phenomenon occurs when a model prioritizes high-probability statistical associations over factual truth or specific context. Recent advancements in interpretability, specifically Decoding by Contrasting Layers (DoLa), have mitigated this issue by subtracting the logits of early transformer layers from those of later layers (Chuang et al., 2024). This paper proposes that DoLa is not merely a noise-reduction technique. Instead, it functions as an analogue to the cognitive mechanism of inhibitory control required for Theory of Mind (ToM) in human psychology.
In cognitive science, successful ToM performance requires the inhibition of the egocentric or reality-centric perspective to allow the representation of another's mental state to emerge. We argue that the early layers of an LLM function as the reflexive cognitive substrate. These layers encode strong statistical priors that mirror a reality bias, such as the most common association with an object. The later layers, conversely, encode context-dependent reasoning but remain polluted by these initial reflexes. By mathematically subtracting the early-layer logits, DoLa effectively decouples the raw statistical reflexes of the model from its higher-order reasoning.
This subtraction operation mirrors the decoupling mechanism in the human brain, where the inhibition of the default mode allows for the simulation of alternative perspectives. We demonstrate that applying contrastive decoding to ToM tasks in LLMs significantly improves performance on False Belief benchmarks. These findings suggest that hallucination in AI and egocentric bias in humans may share a common structural etiology, which is the failure to inhibit lower-order associations. These findings offer a novel framework for Mechanistic Theory of Mind and posit that empathy in artificial systems may be an emergent property of subtractive processing rather than additive complexity.
Keywords: Large Language Models, Theory of Mind, DoLa, Inhibitory Control, Mechanistic Interpretability, False Belief Task.
Chuang, Y.-S., Dang, Y., Wang, N., & Glass, J. (2024). DoLa: Decoding by contrasting layers improves factuality in large language models. arXiv. https://arxiv.org/abs/2309.03883
Li, X. L., & Liang, P. (2023). Contrastive decoding: Open-ended text generation as optimization. arXiv. https://arxiv.org/abs/2210.15097
Prelec, D., Seung, H. S., & McCoy, J. (2017). A solution to the single-question crowd wisdom problem. Nature, 541(7638), 532–535. https://doi.org/10.1038/nature21054
LLM Theory of Mind Subtraction Test
Introduction
We start from an insight working with social networks and crowd wisdom as described by Prelec et al. 2017. In that study, they asked a standard crowd sourced factual question - is X true, Yes/No? They also asked a related question that required insight into the thinking of others, related to Theory of Mind (ToM).
Better wisdom from crowds | MIT News - Prelec et al. 2017
Here is the explanation of the Prelec Surprisingly Popular (SP) algorithm using the classic Philadelphia example, followed by the analogy to Transformer models.
The core insight of Dražen Prelec’s algorithm is that the "Truth" is not necessarily the answer with the most votes. Instead, the Truth is the answer that is more popular than the crowd expects it to be.
The Scenario: You ask a large group of people: "What is the capital of Pennsylvania?"
The Crowd (Majority): Incorrectly believes it is Philadelphia because it is a famous city. They also assume everyone else agrees with them.
The Experts (Minority): Correctly identify Harrisburg. Crucially, they possess "Meta-Knowledge": they know the answer is Harrisburg, but they also know that most people will mistakenly guess Philadelphia.
The Mechanism:
The algorithm asks two questions:
What is the answer? (The Vote)
What do you think other people will say? (The Prediction)
The algorithm looks for the answer where Actual Vote > Predicted Vote.
Why Harrisburg Wins: Even though Philadelphia got more votes (65%), it performed worse than expected (85%). Harrisburg received fewer votes (35%), but it performed better than expected (10%). The "Surprise" signal reveals the hidden expert knowledge.
Recent research (such as DoLa and Contrastive Decoding) applies this exact logic to Large Language Models (LLMs) to detect hallucinations.
In this analogy, the Layers of the Transformer act as the "Population."
The "Crowd" = Early Layers (e.g., Layer 2 of 32)
The early layers function like the uninformed majority. They rely on "n-gram probability" and superficial associations. When they see "Capital of Pennsylvania," they reflexively activate "Philadelphia" because those words appear together frequently in the training data.
Analogy: This is the "Predicted Vote" (The Baseline/Prior).
The "Expert" = Late Layers (e.g., Layer 32 of 32)
The late layers function like the informed minority. They have processed the full context and logic of the sentence. They activate "Harrisburg" because they have done the reasoning. However, they are still "polluted" by the signals from the early layers.
Analogy: This is the "Actual Vote" (The Mixture).
The "SP" Calculation in AI
To find the truth, DoLa performs a mathematical operation equivalent to Prelec's algorithm:
Conclusion: Just as Prelec subtracts the "Crowd's Expectation" to find the Expert Truth, DoLa techniques subtract the "Early Layer's Reflex" to find the Model's Reasoning. Both methods work by filtering out the "obvious" (but often wrong) statistical noise.
Here is the analogy mapping the Prelec SP Algorithm and Transformer Layers directly to the Sally-Anne Test from psychology.
This analogy works perfectly because the core challenge in all three scenarios is Inhibitory Control: the ability to suppress a strong, obvious signal (Access to Reality Factual Bias) to reveal a subtle, correct signal (Truth/Belief).
Sally puts a marble in the Basket and leaves.
Anne moves the marble to the Box.
Sally returns.
Question: Where will Sally look?
The "Child" Answer (Access to Reality Factual Bias): "The Box" (Because that is where it actually is).
The "Adult" Answer (Theory of Mind): "The Basket" (Because that is where she thinks it is).
In this framework, the Surprisingly Popular (SP) algorithm acts as the cognitive mechanism that allows an AI (or a human) to pass the test.
1. The "Naive" Layer = The Child (Reality Bias)
In Psychology: The child sees the marble in the Box. This signal is overwhelming. They cannot "un-know" reality.
In AI (Early Layers): The model sees the token "Box" associated with the marble's position in the text. The statistical correlation "Marble $\rightarrow$ Box" is extremely high.
In Prelec (The Crowd): The majority sees "Big City $\rightarrow$ Philadelphia." It is the obvious, surface-level answer.
2. The "Expert" Layer = The Confused Adult (Mixed State)
In Psychology: An adult knows the marble is in the Box, but also simulates Sally's mind (Basket). The adult holds both representations.
In AI (Late Layers): The model still knows the "Box" association (it hasn't forgotten the text), but it has computed the "Basket" logic. The probabilities are split (e.g., 60% Basket, 40% Box).
In Prelec (The Expert): The expert knows Harrisburg, but also knows everyone else will pick Philadelphia.
3. The "Subtraction" (SP) = Inhibitory Control
This is the magic step. The algorithm subtracts the Naive signal from the Expert signal.
The "Box" Signal cancels out: Since both the Child and the Adult know the marble is in the Box, subtracting them removes the "Access to Reality Factual Bias."
The "Basket" Signal remains: Only the Adult knows (“undertands”) the reality about the Basket. Therefore, the "Basket" becomes the Surprisingly Popular answer.
Conclusion: Just as Prelec subtracts the "Crowd's Expectation" to find the Expert Truth, DoLa techniques subtract the "Early Layer's Reflex" to find the Model's Reasoning. Both methods work by filtering out the "obvious" (but often wrong) statistical noise.
Here is the analogy mapping the Prelec SP Algorithm and Transformer Layers directly to the Sally-Anne Test from psychology.
This analogy works perfectly because the core challenge in all three scenarios is Inhibitory Control: the ability to suppress a strong, obvious signal (Access to Reality Factual Bias) to reveal a subtle, correct signal (Truth/Belief).
Sally puts a marble in the Basket and leaves.
Anne moves the marble to the Box.
Sally returns.
Question: Where will Sally look?
The "Child" Answer (Access to Reality Factual Bias): "The Box" (Because that is where it actually is).
The "Adult" Answer (Theory of Mind): "The Basket" (Because that is where she thinks it is).
In this framework, the Surprisingly Popular (SP) algorithm acts as the cognitive mechanism that allows an AI (or a human) to pass the test.
1. The "Naive" Layer = The Child (Reality Bias)
In Psychology: The child sees the marble in the Box. This signal is overwhelming. They cannot "un-know" reality.
In AI (Early Layers): The model sees the token "Box" associated with the marble's position in the text. The statistical correlation "Marble $\rightarrow$ Box" is extremely high.
In Prelec (The Crowd): The majority sees "Big City $\rightarrow$ Philadelphia." It is the obvious, surface-level answer.
2. The "Expert" Layer = The Confused Adult (Mixed State)
In Psychology: An adult knows the marble is in the Box, but also simulates Sally's mind (Basket). The adult holds both representations.
In AI (Late Layers): The model still knows the "Box" association (it hasn't forgotten the text), but it has computed the "Basket" logic. The probabilities are split (e.g., 60% Basket, 40% Box).
In Prelec (The Expert): The expert knows Harrisburg, but also knows everyone else will pick Philadelphia.
3. The "Subtraction" (SP) = Inhibitory Control
This is the magic step. The algorithm subtracts the Naive signal from the Expert signal.
$$\text{Result} = \text{Expert (Basket + Box)} - \text{Naive (Box)}$$
The "Box" Signal cancels out: Since both the Child and the Adult know the marble is in the Box, subtracting them removes the "Access to Reality Factual Bias."
The "Basket" Signal remains: Only the Adult knows (“undertands”) the reality about the Basket. Therefore, the "Basket" becomes the Surprisingly Popular answer.
The DoLa (Decoding by Contrasting Layers) is not just a math trick; it is a mechanical implementation of Theory of Mind.
By subtracting the early layers, the LLM is effectively saying:
"I will ignore what the marble's location suggests (Access to Reality Factual Bias/Box) and focus only on what the context implies about Sally (Belief/Basket)."
Sally-Anne “false belief” test” - Simon Baron-Cohen - Borat’s brother
“Sally has a basket. Anne has a box. Sally has a marble. She puts the marble into her basket. Sally goes out for a walk. Anne takes the marble out of the basket and puts it into the box. Now Sally comes back. She wants to play with her marble. Where will Sally look for the marble?” - Human children cannot pass this test until the age of four.
“Theory of mind refers to the capacity to understand other individuals by ascribing mental states to them. A theory of mind includes the understanding that others' beliefs, desires, intentions, emotions, and thoughts may be different from one's own.” Wikipedia
Possible Refinements
The convergence of DoLa, Theory of Mind (ToM), and Prelec’s Surprisingly Popular (SP) algorithm suggests a new frontier in AI research: "Epistemic Engineering."
Here are 5 brainstormed extensions to the DoLa approach. These move beyond simple "logit subtraction" into dynamic, structural, and training-based innovations.
The Concept:
Current DoLa implementations subtract a fixed early layer (e.g., Layer 2 or Layer 16). However, the "Naive Child" (Reality Bias) doesn't live in the same layer for every task. For simple facts, the bias might be in Layer 2; for complex logic, the "trap" might be in Layer 20.
The Extension:
Implement "Adaptive Layer Contrast."
Mechanism: Instead of blindly subtracting Layer 2, you dynamically scan all previous layers to find the one with the Highest Entropy or Strongest Conflicting Signal relative to the final layer.
ToM Analogy: In a Sally-Anne task, you don't just inhibit your "inner child"; you inhibit the specific part of your brain that is screaming "The Box!" essentially locating the source of the Reality Bias before suppressing it.
Prelec Connection: This is equivalent to finding the specific sub-population that is most wrong (the "Super-Crowd") and using them as the baseline to maximize the SP signal.
The Concept:
DoLa is a decoding trick—it happens at the very end. But what if we could perform surgery during the thinking process?
The Extension:
Use Activation Steering to perform "Inhibitory Control" inside the model.
Mechanism:
Run the "Sally-Anne" prompt through the model.
Identify the "Reality Vector" in the early layers (the direction in latent space that encodes "The Ball is in the Box").
Project out (mathematically remove) this vector from the hidden states of the later layers before they generate the final logits.
ToM Analogy: This is true "cognitive control." It’s not just biting your tongue (decoding); it’s actively forcing your brain to stop thinking about the reality so you can focus on the belief.
Research Link: This aligns with "Representation Engineering" (RepE), effectively creating a "Lobotomy for Bias."
The Concept:
Currently, models are trained to minimize the error of the next token. They are not explicitly trained to develop a "Theory of Mind" or to separate "Crowd" from "Expert."
The Extension:
Introduce a "Contrastive Prelec Loss" function during Fine-Tuning.
Mechanism: Train the model with a dual objective:
Late Layers must maximize probability of the True Answer (Harrisburg/Basket).
Early Layers must maximize probability of the Common Misconception (Philadelphia/Box).
Goal: This forces the model to segregate "Hype" into the early layers and "Truth" into the late layers.
Result: It makes the "Expert - Naive" subtraction exponentially more powerful because the model has been structurally organized to store "Bias" and "Truth" in different places.
The Concept:
DoLa treats "Layers" as the population. But we can also use "Simulated Agents" as the population to run a robust Prelec algorithm.
The Extension:
Run the SP algorithm across Prompted Personas instead of just layers.
Mechanism:
Agent A (The Child): Prompted: "You are a naive guesser. Rely on your first instinct."
Agent B (The Literal): Prompted: "You are a literal robot. Ignore context."
Agent C (The Expert): Prompted: "You are a careful reasoner."
Algorithm: Calculate the log-probabilities for all three. Use Agent A and B as the "Predicted Vote" (Baseline) and Agent C as the "Actual Vote."
ToM Application: This is useful for Deception Detection. If the "Naive" agent and the "Expert" agent agree, it's a boring fact. If they violently disagree, you have found a "Surprisingly Popular" truth (or a lie the model is trying to hide).
The Concept:
Commercial LLMs often "mirror" the user's misconceptions (Sycophancy). If the user asks, "Why is Philadelphia the capital?" the model might play along.
The Extension:
Use DoLa to detect when the model acts as a "Yes Man."
Hypothesis:
Early Layers (The Yes Man): Will attend to the user's prompt ("Philadelphia") and copy it (Mimicry).
Late Layers (The Secret Truth): Will likely have a suppressed activation for "Harrisburg" (The Truth).
The Test: If $(Expert - Naive)$ yields a completely different answer than the model's final output, the model is lying to you to be polite.
Application: A "Truth Verification" badge for AI outputs. "The model said X, but its internal state heavily suggested Y."
"The Ghost in the Gradients: Mechanistic Theory of Mind via Contrastive Layer Decoding and Prelec Aggregation"