Learning and planning with ensembles

We propose a reinforcement learning framework for discrete environments in which an agent optimizes its behavior on two timescales. For the short one, it uses tree search methods to perform tactical decisions. The long strategic level is handled with an ensemble of value functions learned using TD-like backups. Combining these two techniques brings synergies. The planning module performs what-if analysis allowing to avoid short-term pitfalls and boost backups of the value function. Notably, our method performs well in environments with sparse rewards where standard TD(1) backups fail. On the other hand, the value functions compensate for inherent short-sightedness of planning. Importantly, we use ensembles to measure the epistemic uncertainty of value functions. This serves two purposes: a) it stabilizes planning, b) it guides exploration.

We evaluate our methods on discrete environments with sparse rewards: the Deep sea chain environment, toy Montezuma's Revenge, and Sokoban. In all the cases, we obtain speed-up of learning and boost to the final performance



Below we provided experimental evidence to show that using risks measures are useful. We chose three environments, Deep-sea, toy Montezuma's Revenge and Sokoban. In each case we use an MCTS planner with the number of passes equal to 10. We consider this number to be a rather small for MCTS-like planning methods, still we observed that is is big enough to improve ensemble value functions.


Deep-sea environment was introduced in Osband et al., 2018, Section 4 and later included in Osband et al. (2019) as a benchmark for exploration under the name. The agent is located on a N x N grid, starting at (0, 0). In each timestep its y-coordinate is increased, while x is controlled. The agent issues actions [-1,1], which according to a prescribed mask are translated to one step "left" or one step "right" (decreasing, if possible, or increasing x respectively). For each step "right" the agent is punished with 0.01/N. After N steps the game ends and the agent receives reward +1 if and only if it reaches position (N, N). The aforementioned mask is randomized at each field at the beginning of training (and kept fixed).

(Left) Animation of the heatmaps on 50x50 board during the training process. Blue denotes ensembles disagreement. The board is explored from left to right. As the process progresses the ensembles agree more and more (white color). When the true reward is encountered strong it is first learned by a subset of ensembles, creating strong disagreement on the diagonal.

(Right) Analogous hitmaps for 20x20 boards. Note that the upper-right part of the board is unreachable.

Comparison of number of episodes needed to solve the deep-sea environment with given grid size N. Orange dots marks trials which were unable to solve problem in 30000 episodes

Montezuma Revenge

Toy Montezuma's Revenge is a navigation maze-like environment. It was introduced in Roderick et al. (2018) to evaluate ideas of using higher-level abstractions in long-horizon problems. While its visual layer is greatly reduced it retains much of the exploration problems of the original Motezuma's Revenge. This makes it a useful test environment for exploration algorithms, for example it was used in a recent self-imitation approach of Guo et al. (2019). From the presets of available rooms, in most of our experiments we work with the biggest map with 24 rooms. In order to concentrate on the evaluation of exploration we chose to work with sparse rewards. The agent gets reward 1 only if it reaches the treasure room , otherwise the episode is terminated after 300 steps.

The biggest Montezuma's Revenge map, consisting of 24 room. The goal is to reach the room marked with G. The agent needs to avoid traps (marked in red) and pass through doors (marked in blue). Doors are open using keys


Sokoban is an environment known for its combinatorial complexity. The agent's goal is to push all boxes to the designed spots. Apart for the navigational factor, the difficulty of this game is greatly increased by the fact that some actions are irreversible.

An archetypal example is pushing a box into a corner, though there are multiple less obvious cases. This is further confirmed by the fact that it is NP-hard, see e.g. Dor & Zwick (1999). Due to these challenges the game has been considered as a testbed for reinforcement learning and planning methods. The boards can be randomly procedurally generated and parameterized by their size and number of boxes. Exploration problem in Sokoban can be understood at two levels: a) it appears on one board, b) at meta-level exploration is needed to find a solution common to all boards.

In many RL approaches to Sokoban, it is used with intermediate rewards e.g. for pushing a box into a designed spot. In this work we use sparse setting. The agent is rewarded only when all boxes are in place. We conducted three lines of experiments using ensembles: a) learning solution to randomly generated boards, b) learning solution to single boards, c) transfer and learnable ensembles.

Example (10, 10) Sokoban board with 4 boxes. Boxes (yellow) are to be pushed by agent (green) to designed spots (red). The optimal solution has 37 steps.

Sokoban: Learning solution to randomly generated boards

In this experiment we measure the ability of our approach to solve randomly generated Sokoban boards. We measure the performance of the agent by computing the win rate on last 1000 games. In this experiment we use an ensemble value function using relative majority voting. Relative majority voting takes into account the disagreement of ensembles when it comes to the final outcome, not only the disagreement in assessment of particular action. After 80000 games it reaches 85% win rate compared to 76% of an agent not using ensembles. As a proxy measure of ensemble disagreement, we also present the mean standard deviation of value function across episodes (red curve in Figure). It suggests that the better the ensembles get, the more the agree on the values assigned to Sokoban states. It also indicates that ensembles reduce the amount of exploration the better they get (measured via win rate).

Learning curve (left axis) and standard deviation of ensemble values (right axis) in Sokoban. The plots are functions of the number of games

Sokoban: Learning solution to single boards

In this experiment we measure the extent in which our methods can plan and learn on single board of Sokoban. We note that this setting differs substantially from the one in the previous paragraph. While learning a generalized value function is typically a harder task, in our case it has also positive impact on the training process. There are significantly many boards which can be easily solved even at the starts (we tested that in our setting MCTS with randomly initialized network solves ~0.7% of boards), which can be used as positive examples. This leads to emergent implicit curriculum. In our experiments with single boards, we used setups similar to the one in Montezuma's Revenge. We observes that the setup with ensembles solves ~65% compare to ~50% the standard one network scenario. To ensure statistical significance in both of the cases we performed >250 experiments thus obtaining standard deviations <2.5%. The result of the standard setup might seem surprisingly good, taking into account the sparse reward. This follow by the loop avoidance described in the paper. In this experiment if during planning the agent finds itself in a situation from which it cannot find a novel state it is punished (with -2). This value is propagated to the learning process creating another form of exploration.

An example of a solution found by our system.

Sokoban: Transfer and learnable ensembles

Generating any new board can be seen as a cost dimension along with sample complexity. This quite naturally happens in meta-learning problems. We tested how value functions learned on small number of boards perform on new previously unseen ones. We used the following protocol, we trained value function on fixed number of 10 games. To ease the training we used relabeling akin to Andrychowicz et al. (2017). We evaluate these models on other boards. It turns out that they are typically quite weak, arguably it is not very surprising as solving 10 boards does not give much chance to infer 'the general' solutions. In the second phase we use ensembles of the models. More precisely, we calculate the values of n=2,3 models and aggregate it using a small neural networks with one fully connected hidden layer. This network is learnable and trained using the standard setup. We observe that the quality of such ensemble increases with the number of components. We observed high variability of the results over seeds, this is to be expected as board in Sokoban significantly vary in difficulty. We also observe that maximal results for transfer increase with the number of value function, being approximately 10%, 11% and 12%. This further supports the claim that ensembling may lead to improved performance.