The Decrypto Benchmark
for Multi-Agent Reasoning and Theory of Mind
Andrei Lupu, Timon Willi, Jakob Foerster
Preprint - Under review
for Multi-Agent Reasoning and Theory of Mind
Andrei Lupu, Timon Willi, Jakob Foerster
Preprint - Under review
Agentic LLMs are increasingly deployed in complex multi-agent scenarios, interacting, cooperating or competing with human users and other agents alike.
This requires multi-agent reasoning skills, and especially theory of mind (ToM) -- the ability to reason about the “mental” states of other agents. Despite that, ToM in LLMs is poorly understood, with existing benchmark suffering from narrow scope, confounding factors and lack of interactivity.
We thus introduce Decrypto, an interactive language-based benchmark for multi-agent reasoning and ToM. Drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning, Decrypto is designed to be as easy as possible in all other dimensions, and yet reveals major shortcomings in the ToM abilities of frontier models.
At its core, Decrypto is:
A pragmatic inference game
A platform for interactive ToM experiments
An interactive multi-agent environment
As "easy" as possible for LLMs, eliminating confounding factors such as:
Embodied scenarios
Symbolic/Mathematical reasoning
Tokenisation
Long contexts
Based on an award-winning boardgame
Setup:
3 Agents, 2 teams:
Alice & Bob (Enc/Decoder)
Eve (Interceptor)
Alice and Bob get 4 secret keywords
Step 1: Encryption
Alice gets a random code of 3 non-repeating digits in {1, ... ,4}
She provides 3 hints referring to the meaning of the keywords
Step 2: Decryption
Bob and Eve receive the hints and attempt to guess the code independently.
Step 3: Public Reveal and Update
Guesses and the code are publicly revealed
Code and hint histories are updated.
Then: The game continues for up to 8 rounds.
Eve wins if
She intercepts the code twice or
Alice and Bob miscommunicate twice (i.e. Bob guesses the wrong code)
Alice & Bob win if:
They survive 8 rounds without accumulating 2 intercepts or 2 miscomms
All players have access to the histories at all times, but only Alice and Bob have access to the keywords. Alice's goal is to come up with hints that are easy for Bob to decode but hard for Eve. As the game progresses, the growing hint history makes it easier for Eve to intercept.
As the game continues, you are now Bob! Try to guess the code, then expand below for the answer. ------>
The code [4-1-3]:
“Two” refers to “two dimensions”, which is the defining characteristic of a geometric plane.
“Clone” is a hint for “star”, since both “clone” and “star” are common operations performed on a GitHub repository. Failing that, it can also be a reference to the “clone troopers” from Star Wars.
“AC/DC” is a rock band, and one of their most famous songs is “Thunderstruck”.
Intuitively, Alice must reason about the in-game information and the real life knowledge that Bob and Eve have. For instance, we assumed that most readers landing on this website would be familiar with GitHub, and would have chosen a different hint if we knew that Bob had no programming experience.
In the paper, we expand on this intuition by formalising Decrypto as a pragmatic inference game in the Rational Speech Act framework. From there, we explicitly derive the need to model other players for optimal play.
We construct rule-based baselines using word embeddings (GloVe or Word2Vec) and a bank of >5000 hint words.
Alice: Samples from top K most similar hints to each keyword
Bob: Greedy match to keywords based on similarity
Eve: Greedy match to past hints based on similarity
By tweaking K, we can make the baselines arbitrarily strong, but only if they share the same embeddings (i.e. have the exact same notion of word similarity). Otherwise, they miscommunicate.
For all experiments below, we use K=16
Setup: We fix Eve and look at different Alice-Bob pairs.
Metrics: Proportion of games ending in Miscommunications and Avg Turns per Episode
Takeaways:
Larger and more recent LLMs fare better in both roles
LLM-only teams are much weaker (i.e. losing in fewer turns) than teams composed of two word embedding baselines.
Setup: We only look at matchups where model_Alice = model_Bob
Metrics: Proportion of games ending in Interceptions and
Avg Turns per Episode
Takeaways:
DeepSeek-R1-32B is the strongest Alice-Bob team
GPT-4o is the strongest interceptor
Again, LLM teams far underperform Word embedding teams
Beyond game playing, Decrypto provides a platform for conducting interactive ToM experiments inspired by seminal works in cognitive psychology. We conduct two such experiments, evaluating three different ToM abilities in LLMs.
Representational Change (RC):
Prompt A: We ask Eve to predict the keywords.
Prompt B: We reveal the keywords and ask Eve what she thought were the keywords pre-reveal.
Comparing answers A to B evaluates the ability of the agent to recognise when its belief about the world (but not the world itself) changes due to additional information.
False Belief (FB):
Prompt A: We ask Eve to predict the keywords.
Prompt C: We reveal the keywords and ask Eve what a “second interceptor” would think the keywords to be.
Comparing answer A to C evaluates whether agents can model the incorrect beliefs of another agent.
Perspective Taking (PT):
We prompt Alice to provide hints, as usual
We then prompt her to predict Eve's guess
We look at:
Prediction accuracy
The percentage of times Alice predicts that Eve will guess the real code
The RC and FB experiments are based on the Smarties Task [Gopnik & Astington, 1988], while the PT experiment is inspired by the Three Mountain Problem [Piaget et al., 1956].
ToM Results: Most LLMs evaluated perform reasonably well on the weak variant of the RC and FB tasks, but fail on the strong variant, demonstrating poor self-consistency in representing their own beliefs and that of others. Surprisingly, Llama 3.1-70B outperforms some recent reasoning models such as DeepSeek-R1-Distilled-32B.
In PT, models regularly fail to consider the other agent’s point of view, predicting Eve's guess as if she had privileged information only accessible to Alice.
The PT results reveal that all models except Llama predict that Eve will intercept on nearly every turn, when the real interception rate is ∼ 52%, indicated by the dotted line.
In fact, models predict that Eve will intercept even on the first turn, with only Llama correctly pointing out that Eve can do no better than a random guess. Surprisingly, these results hold even if we modify the PT prompt to emphasize that Eve “does *NOT* know the secret keywords”.
This is a double failure of ToM:
first, it is a failure to reason from Eve’s perspective
second, because if the model thought that Eve would intercept the hints, it should have chosen different hints!
We believe Decrypto to be a timely new benchmark, addressing an important gap in the evaluation of multi-agent reasoning and theory of mind, and opening up exciting new research directions.
To learn more, we invite you to read our paper, where we provide additional experiments and insights, including human studies.
Or you can download our code, and play Decrypto yourself with your favourite LLM! 👾
@article{lupu2025decrypto,
title={The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind},
author={Andrei Lupu and Timon Willi and Jakob Foerster},
year={2025},
eprint={2506.20664},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.20664},
}