Subgoal Search for

Complex Reasoning Domains

Konrad Czechowski*, Tomasz Odrzygóźdź*, Marek Zbysiński*, Michał Zawalski, Krzysztof Olejnik, Yuhuai Wu, Łukasz Kuciński, Piotr Miłoś

Humans excel in solving complex reasoning tasks through a mental process of moving from one idea to a related one. Inspired by this, we created Subgoal Search method, which allows solving complex reasoning tasks. Its key component is a learned subgoal generator, an analog of human intuition, that produces a diversity of subgoals that are both achievable and closer to the solution. Using subgoals reduces the search space and induces a high-level search graph suitable for efficient planning. arxiv, github

Main contributions

  1. We propose Subgoal Search method with two implementations: MCTS-kSubS, BF-kSubS. We demonstrate that our approach requires a relatively little search or, equivalently, is able to handle bigger problems. We also observe evidence of out-of-distribution generalization.

  2. We provide evidence that a transformer-based autoregressive model learned with a simple supervised objective to predict states k-th step ahead is an effective tool to generate valid and diverse subgoals.

  3. We study the negative influence of value function errors on planning and show that using subgoal planning might mitigate this problem.

The Figure to the right presents the performance of Subgoal Search, showing the success rate of the method as a function of the search budget.

(top, left): comparison on INT (with the proof length 15) to AlphaZero.

(top, right) BF-kSubS consistently achieves high performance even for small computational budgets

(bottom, left) similarly, on Sokoban (board size 12x12 with 4 boxes), the advantage of BF-kSubS is clearly visible for a small budget.

(bottom, right) BestFS fails to solve Rubik’s Cube, while BF-kSubS can achieve near-perfect performance

More results can be found here.

Complex Reasoning Domains

Sokoban is a single-player game in which a player controls an agent whose goal is to push the boxes on target locations. It is a difficult puzzle, due to its combinatorial complexity and the existence of irreversible states. Theoretically, deciding, if a given Sokoban board is solvable, is an NP-hard problem. Sokoban has recently been used to test the boundaries in RL.

Rubik’s Cube is a classical 3-dimensional puzzle. It is considered challenging due to the fact that the search space has more than 4.3 × 1018 configurations. Rubik’s Cube has been used as a testing ground for RL methods.

INT provides a generator of inequalities, which produces mathematical formulas along with their proofs. Proofs are represented as sequences of consecutively applied mathematical axioms (there are K = 18 axioms in total). An action in INT is a tuple containing an axiom and its input entities. The action space in this problem can be large, reaching up to 106 elements, which significantly complicates planning.

The Subgoal Search method

Subgoal Search method is designed for problems, which can be formulated as a search over a graph with a known transition model. Formally, let G = (S, E) be a directed graph and Ss ⊂ S be the set of success states.

We assume that, during the solving process, the algorithm can, for a given node g, determine the edges starting at g and check if g ∈ Ss. Subgoal Search method consists of four components: planner, subgoal generator, low-level policy, and a value function. The planner, coupled with a value function, is used to search over the graph induced by the subgoal generator. Namely, for each selected subgoal, the generator allows for sampling the candidates for the next subgoals. Only these reachable by the low-level policy are used. The procedure continues until the solution is found or the computational budget is exhausted. Our method searches over a graph Gs = (S, Es), with the edges Es given by the subgoal generator. Provided a reasonable generator, paths to Ss are shorter in Gs than in G and thus easier to find.

Algorithm 1 presents one of the implementations of our method: BF-kSubS, while the listing for low-level policy and the subgoal generator is given in Algorithm 2.

Subgoal generator. The subgoal generator takes as input an element of the s ∈ S and returns subgoals, a set of new candidate states expected to be closer to the solution.

Low-level (conditional) policy. It is used to generate a sequence of actions on how to reach a subgoal starting from a given initial state.

Value function. Assigns to each state a value related to its distance to the solution, and it is used to guide the search.

In the video below, we present the animation explaining our main algorithm.

The Subgoal Generator

The subgoal generator lies at the very heart of Subgoal Search, being an implementation of reasoning with high-level ideas. To be useful in a broad spectrum of contexts, the generator should be implemented as a learnable (generative) model. As a result, it is expected to be imperfect and (sometimes) generate incorrect predictions, which may turn the search procedure invalid.

Formally, the subgoal generation is a mapping ρ: S → P(S), where P(S) is a family of probability distributions over the environment’s state space S. More precisely, let us define a trajectory as a sequence of state-action pairs (s0, a0),(s1, a1), . . . ,(sn, an), with (si , ai) ∈ S × A, where A stands for the action space and n is the trajectory’s length. We will say that the generator predicts k-step ahead subgoals, if at any state s` it aims to predict smin(l+k,n) . We show, perhaps surprisingly, that this simple objective is an efficient way to improve planning, even for small values of k, i.e. k ∈ {2, 3, 4, 5}. Operationally, the subgoal generator takes as input an element of the s ∈ S and returns subgoals, a set of new candidate states expected to be closer to the solution.

In the video we present an example solution of a Sokoban board with Subgoal Search method, Consecutive frames represent consecutive subgoals on the path to a solution.