Monte Carlo Tree Diffusion for System 2 Planning

Jaesik Yoon Hyeonseo Cho Doojin Baek Yoshua Bengio Sungjin Ahn

Paper

Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)—whose performance naturally improves with additional test-time computation (TTC)—standard diffusion-based planners offer only limited avenues for TTC scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as TTC increases.

Using a tree-search-based denoising approach, MCTD broadens its generation diversity and improves performance on complex behavior planning tasks at test time.

Monte Carlo Tree Diffusion

Monte Carlo Tree Diffusion consists of the three key concepts to enable the tree-structured planning. (1) Denoising as Tree-Rollout, (2) Guidance Levels as meta-Actions, and (3) Jumpy Denoising as Simulation. These innovations form the foundation of MCTD, bridging the gap between traditional tree search methods and diffusion-based planning.

Algorithm

Denoising as Tree-Rollout

Unlike a standard Diffuser (where every subplan follows the same denoising schedule), our approach assigns a separate denoising pace to each subplan—faster for the earlier segments and slower for the later ones. This makes the process causal and semi-autoregressive, so that the future is denoised based on the already refined past. At the same time, we still retain the globally consistent and holistic generation benefits of a Diffuser. In other words, our denoising strategy achieves a semi-autoregressive effect without losing the advantages of a fully global approach.

In MCTD, we represent each subplan (a temporally extended state) as a single node in the search tree, rather than using individual time steps. By operating at this higher level of abstraction, the tree search becomes more efficient and scalable.

Guidance Levels as Meta-Actions

In standard MCTS, building and searching the tree becomes computationally expensive when the action space is large and is essentially infeasible for continuous actions. To tackle these challenges, MCTD introduces the idea of meta-actions, which redefine the exploration–exploitation balance at a higher level. Meta-actions can be any discrete decisions that steer how the algorithm explores or exploits. In this work, we implement them as different guidance levels during the denoising process.

Jumpy Denoising as Fast Simulation

In MCTS, it’s crucial to evaluate a node before reaching a leaf where the plan can be fully tested. Typically, you either roll out the full trajectory with a forward dynamics model (which can be very costly) or approximate the node’s value via bootstrapping (faster but less accurate).

In MCTD, we solve this with a fast jumpy denoising strategy, building on DDIM. As soon as the denoising process reaches the subplan, the remaining steps are accelerated by skipping every C step. This method combines speed with effective value estimation during search.

Experimental Results

We evaluate the proposed approach, MCTD, on a suite of tasks from the Offline Goal-conditioned RL Benchmark (OGBench).

Maze Navigation with Point-Mass and Ant Robots

We evaluated MCTD with the pointmaze and antmaze tasks from OGBench, featuring mazes of varying sizes (medium, large, and giant). The agent is rewarded for reaching a designated goal region, and the relevant dataset (navigate) comprises longhorizon trajectories that challenge each method’s capacity for exploration.

Table 1 presents success rates across medium, large, and giant mazes for both point-mass and ant robots. We can see that MCTD consistently surpasses other methods by a large margin, especially, it showed almost perfect performance on largest map, giant.

Robot Arm Cube Manipulation

To further validate MCTD’s search capabilities, we tested it on multi-cube manipulation tasks from OGBench, where a robot arm must move one to four cubes to specified table locations. As the number of cubes increases, both the planning horizon and the combinatorial complexity grow substantially.

MCTD-Replanning shows significant performance gaps from other models for multi-cube scenarios. MCTD exhibits moderate gains (e.g., 22% on two cubes), though it suffers from holistic plan entanglements when multiple objects.

Visual Pointmaze

To evaluate our method in image-based planning, particularly under partial observability, we introduce a visual pointmaze environment where the agent receives top-down pixel images of the initial and goal states.

Diffusion Forcing

MCTD

MCTD and MCTD-Replanning also show better performance than baselines on image-based, partial observable tasks.

Test-Time Compute Scalability & Time Complexity

We examine how each planner leverages additional test-time computation by varying the maximum number of denoising steps and measuring both success rate and runtime on the large and giant pointmaze tasks.

As the TTC budget grows, MCTD consistently achieves high success rates—ultimately nearing perfection in giant mazes.

MCTD incurs additional overhead from maintaining and expanding a tree—leading to a larger runtime as the budget grows. For instance, Diffuser-Random Search shows moderate runtime growth because multiple samples can be denoised in parallel batches, which is not directly feasible in MCTD’s tree-structured approach. Nonetheless, MCTD can terminate early whenever a successful plan is found, thereby limiting unnecessary expansions. As a result, on the large maze, MCTD’s runtime quickly saturates once the task can be solved with a smaller search depth, rather than fully consuming the available budget. A similar pattern emerges in the giant maze: as success rates increase under higher budgets, the runtime growth for MCTD remains sublinear, indicating that many expansions terminate early once suitable trajectories are located.

Conclusion

We introduced Monte Carlo Tree Diffusion (MCTD), a framework designed to combine the best of both worlds: the structured search of Monte Carlo Tree Search and the generative flexibility of diffusion planning to enhance the test-time compute scalability of System 2 planning. MCTD leverages meta-actions for adaptive exploration-exploitation, tree-structured denoising for efficient diffusion-based expansion, and fast jumpy denoising for rapid simulation. Experimental results demonstrate that MCTD outperforms existing approaches in various planning tasks, achieving superior scalability and solution quality. Future work will explore adaptive compute allocation, learning-based meta-action selection, and reward shaping to further enhance performance, paving the way for more scalable and flexible System 2 planning.

Citation

@misc{yoon2025montecarlotreediffusion,

title={Monte Carlo Tree Diffusion for System 2 Planning},

author={Jaesik Yoon and Hyeonseo Cho and Doojin Baek and Yoshua Bengio and Sungjin Ahn},

year={2025},

eprint={2502.07202},

archivePrefix={arXiv},

primaryClass={cs.AI},

url={https://arxiv.org/abs/2502.07202},

}