Full Paper: https://arxiv.org/pdf/2503.13657v2
Code and Dataset: https://github.com/multi-agent-systems-failure-taxonomy/MAST
"Successful systems all work alike; each failing system has its own problems." (Berkeley, 2025)
Despite the increasing adoption of Multi-Agent Systems (MAS) , their performance gains often remain minimal compared to single-agent frameworks. Why do MAS fail?
We have conducted a systematic evaluation of MASs execution traces using Grounded Theory, and introduced the Multi-Agent System Failure Taxonomy (MAST).
We have also developed a scalable MAST-related LLM-as-a-judge evaluation pipeline to diagnose MASs failure modes directly from their execution traces.
We have demonstrated through case studies that failures identified by MAST often stem from system design and interaction issues, not just LLM limitations or simple prompt following, and require more than superficial fixes, thereby highlighting the need for structural MAS redesigns.
While the formal definition of agents remains debated, this study defines an LLM-based agent as an artificial entity with three components: (1) prompt specifications (initial state), (2) conversation trace (state), and (3) the ability to interact with environments, such as tool usage (action). A multi-agent system (MAS) is defined as a collection of agents designed to interact through orchestration, enabling collective intelligence. Despite the increasing adoption of MAS, their performance gains often remain minimal compared to single-agent frameworks or simple baselines like best-of-N sampling . Our empirical analysis reveals high failure rates even for state-of-the-art (SOTA) open-source MAS; for instance, ChatDev achieves only 33.33% correctness on our ProgramDev benchmark (Figure on the right).
To understand MAS failures, we conduct the first systematic evaluation of MAS execution traces using Grounded Theory and iterative refinement. We analyze 7 popular open-source MAS frameworks (MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, AG2) across 200 conversation traces (each averaging over 15,000 lines of text) from diverse tasks, employing expert human annotators. Through this meticulous process, we uncovered 14 unique failure modes, which we systematically organized into a groundbreaking taxonomy called MAST (Multi-Agent System Failure Taxonomy).
MASFT classifies these failures into three key categories:
Specification Issues: These occur when the initial instructions or system architecture are flawed, causing agents to misinterpret tasks or violate constraints.
Inter-Agent Misalignment: Arising from poor communication or collaboration among agents, these failures can derail the entire process, wasting valuable computational resources.
Task Verification: These happen when MAS prematurely end tasks or inadequately verify results, leading to incomplete or incorrect outcomes.
After developing thetaxonomy, MAST, and completing the inter-annotator agreement studies, we came up with an automated way to discover and diagnoze the failures from MAS execution traces using a LLM-as-a-judge pipeline.
The distribution of failures across MAST’s categories is relatively balanced (FC1: 41.77%, FC2: 36.94%, FC3: 21.30%, Figure 2). The absence of a single dominant category suggests MAST provides balanced coverage and captures diverse failure types, rather than reflecting biases from specific system designs. Furthermore, the distinct failure profiles observed across different MAS (Figure 4) highlight MAST’s ability to capture system specific characteristics, such as AppWorld suffers with premature terminations (FM-3.1) and OpenManus suffers from step repetition (FM-1.3).
We performed two case studies to prove that easy-to-implement approaches are not sufficient to deliver robust MASs.
The first case study consists in the MathChat scenario implementation in AG2 where a Student agent collaborates with an Assistant agent capable of Python code execution to solve problems. To improve the performance, we tried two different strategies. The first strategy is to improve the original prompt with a clear structure and a new section dedicated to the verification. The second strategy refines the agent configuration into a more specialized system with three distinct roles: a Problem Solver who solves the problem using a chain-of-thought approach without tools; a Coder who writes and executes Python code to derive the final answer; a Verifier who reviews the discussion and critically evaluate the solutions, either confirming the answer or prompting further debate. For benchmarking, we randomly select 200 exercises from the GSM-Plus dataset
The second case study is performed with ChatDev that simulates a multiagent software company where different agents have different role specifications, such as a CEO, a CTO, a software engineer and a reviewer, who try to collaboratively solve a software generation task. Similarly to the first case study, the first intervention is refining role-specific prompts to enforce hierarchy and role adherence while the second intervention is a fundamental change to the framework’s topolog, from a directed acyclic graph (DAG) to a cyclic graph.
While these interventions yield some improvements (e.g., +15.6% for ChatDev), the results show that simple fixes are still insufficient for achieving reliable MAS performance. Mitigating identified failures will require more fundamental changes in system design.
However, we can use the LLM-as-a-judge pipeline to obtain detailed failure breakdowns before and after these interventions, showcasing how MAST provides actionable insights for debugging and development.
Effect of prompt and topology interventions on AG2 as captured by MAST using the automated LLM-as-a-Judge
Effect of prompt and topology interventions on ChatDev as captured by MAST using the automated LLM-as-a-Judge.