Abstract. Sequential multi-agent systems with large language models can automate complex software tasks, but trust is difficult when errors pass between stages. We study an accountable Planner → Executor → Critic pipeline with clear roles, structured handoffs, and traceable records. Across 8 configurations using 3 LLMs on 3 benchmarks, we analyze where errors start, how they spread, and how they get fixed. Findings: (1) structured, accountable handoffs improve accuracy and reduce failure cascades; (2) roles show distinct strengths and risks, quantified with repair and harm rates; (3) accuracy, cost, and latency trade-offs are task-dependent, with heterogeneous pipelines often most efficient. We offer a practical, data-driven method to design and debug reliable, predictable, and accountable multi-agent systems.
Keywords: Multi-agent LLMs, sequential pipelines, role-based reasoning, agent collaboration, traceable pipeline, accountability, error propagation, repair and harm rates, cost and latency trade-offs.