Uncertainty-sensitive Learning and Planning with Ensembles
We propose a reinforcement learning framework for discrete environments in which an agent makes both strategic and tactical decisions. The former manifests itself through the use of value function, while the latter is powered by a tree search planner. These tools complement each other. The planning module performs a local what-if analysis, which allows avoiding tactical pitfalls and boosts backups of the value function. The value function, being global in nature, compensates for the inherent locality of the planner. In order to further solidify this synergy, we introduce an exploration mechanism with two components: uncertainty modelling and risk measurement. To model the uncertainty, we use value function ensembles, and to reflect risk, we use several functionals that summarize the uncertainty implied by the ensemble. We show that our method performs well on hard exploration environments: Deep-sea, toy Montezuma’s Revenge, and Sokoban. In all the cases, we obtain speed-up in learning and boost in performance.
Method
Experiments
Below we provide experimental evidence to show that using ensembles and risk measures is useful. We chose three environments: Deep-sea, toy Montezuma's Revenge, and Sokoban. In all cases we work with sparse reward versions of the environments i.e. the agent's is rewarded only upon successful completion of the task.
We use an MCTS planner with the number of passes equal to 10 (see line 4 of Algorithm 2), which is rather modest for MCTS-like planning methods (e.g. the recent work Schrittwieser et al. (2019) uses 800 passes to plan in board games). Interestingly, we observed that such a relatively weak planner is sufficient to obtain a well-preforming algorithm.
In the Sokoban multi-board experiments, we use a learned model; otherwise we assume access to the perfect model. A good model for the Sokoban environment could be acquired from random trajectories, which is improbable for other domains. Extending our algorithm to these cases is an exciting research direction, which will require development of planning methods robust to model errors, possibly again using ensembles (see e.g. Kurutach et al. (2018)) and training the model and the policy in an interlaced fashion (e.g. similar to Kaiser et al. (2019)).
We utilize various neural network architectures. We measure uncertainty using standard deviation except for the case of Sokoban with randomly-generated boards, where voting was used, see equation (*).
Deep-sea
Deep-sea environment was introduced in Osband et al., 2018, Section 4 and later included in Osband et al. (2019) as a benchmark for exploration. The agent is located in the upper-left corner (position (0,0)) on a N x N grid. In each timestep, its y-coordinate is increased, while x is controlled. The agent issues actions in {-1, 1}. These are translated to step left or step right (increasing x by -1 or +1, respectively, as long as x≥0; otherwise x remains unchanged) according to a prescribed action mask (not to be confused with transition masks in Algorithm 1).
For each step-right the agent is punished with 0.01/N. After N steps, the game ends, and the agent receives reward +1 if and only if it reaches position (N, N). The action mask mentioned above is randomized at each field at the beginning of training (and kept fixed).
Such a game is purposely constructed so that naive random exploration schemes fail already for small N's. Indeed, a random agent has chance (1/2)^N of reaching the goal even if we disregard misleading rewards for step right.
The exploration progress for our method is shown in the figure below. It shows a comparison of non-ensemble models, ensemble models with Thomson sampling but without uncertainty bonus (κ=0), and our final ensemble model with uncertainty bonus κ=50. We conclude that using both sub-sampling and ensembles is essential to achieving good exploration.
The heatmaps of standard deviations of ensemble values in the Deep-sea environment. High values are marked in blue and low in white. At the beginning of training the standard deviation is high for all states. Gradually it is decreased in the states that have been explored. Finally the reward state is found. Note that the upper-right part of the board is unreachable.
(Left) Animation of the heatmaps on 50x50 board during the training process.
(Right) Analogous hitmaps for 20x20 boards.
Comparison of number of steps needed to solve the deep-sea environment with given grid size N. Orange dots marks trials which were unable to solve problem in 1400000 steps. Large problem instances (N>20) are solved only when exploration bonus is used (right-most plot, κ=50).
Montezuma Revenge
Toy Montezuma's Revenge is a navigation maze-like environment. It was introduced in Roderick et al. (2018) to evaluate ideas of using higher-level abstractions in long-horizon problems. While its visual layer is greatly reduced it retains much of the exploration problems of the original Motezuma's Revenge. This makes it a useful test environment for exploration algorithms, for example it was used in a recent self-imitation approach of Guo et al. (2019). From the presets of available rooms, in most of our experiments we work with the biggest map with 24 rooms. In order to concentrate on the evaluation of exploration we chose to work with sparse rewards. The agent gets reward 1 only if it reaches the treasure room , otherwise the episode is terminated after 300 steps.
It is expected that any simple exploration technique would fail in this case (we provide some baselines in the table on the right. Guo et al. (2019) benchmarks PPO, PPO with self imitation learning (PPO+SIL), PPO with count based exploration bonus (PPP+EXP) and their new technique (DTSIL). Only DTSIL is consistently able to solve 24 room challenge, with PPO+EXP occasionally reaching this goal. Our method based on ensembles and model-based planning solves this exploration challenge even in a harder, sparse reward case. The results are summarized in the table on the right. We have three setups: no-ensemble; ensemble, no std; ensemble, std.
In the first case, we train using Algorithm 1 with a single neural-network. In the second case, for each episode we sub-sample 10 members of an ensemble of size 20 to be used and MCTS is guided by their mean. In the final, third case, we follow the same protocol but we add to the mean the standard deviation.
In our experiments we observe that no-ensemble in 30 out of 43 cases does not leave behind the first room. The setup without explicit exploration bonus, ensemble, no std, perform only slightly better. Finally, we observe that ensemble, std behaves very well.
The biggest Montezuma's Revenge map, consisting of 24 room. The goal is to reach the room marked with G. The agent needs to avoid traps (marked in red) and pass through doors (marked in blue). Doors are open using keys (marked in yellow).
Sokoban
Sokoban is an environment known for its combinatorial complexity. The agent's goal is to push all boxes to the designed spots. Apart from the navigational challenge, the difficulty of this game is greatly increased by the fact that some actions are irreversible.
A canonical example of such an action is pushing a box into a corner, though there are multiple less obvious cases. Formally, this difficulty manifests itself in the fact that deciding whether a level of Sokoban is solvable or not, is NP-hard, see e.g. Dor & Zwick (1999). Due to these challenges the game has been considered as a testbed for reinforcement learning and planning methods.
Operationally, to generate Sokoban levels we use an automated procedure proposed by Racanière et al. (2017). Some RL approaches to solving Sokoban (e.g. Racanière et al. (2017); Guez et al. (2019)) assume additional reward signal in the game, e.g. for pushing a box into a designed spot. In this work we use sparse setting, that is the agent is rewarded only when all the boxes are put in place. It is interesting to note, that Sokoban offers two exploration problems: single-level-centric, where a level-specific exploration is needed, and multi-level-centric, where a ’meta-exploration’ strategy is required, which works in a level-agnostic manner or can quickly adapt. As a result, we conducted three lines of experiments using ensembles: a) learning to solve randomly generated boards (dubbed as multiple-boards Sokoban), b) learning to solve a single board (dubbed as single-boards Sokoban), c) transfer and learnable ensembles.
Example (10, 10) Sokoban board with 4 boxes. Boxes (yellow) are to be pushed by agent (green) to designed spots (red). The optimal solution has 37 steps.
Multiple-board Sokoban: learning to solve randomly generated boards
In this experiment we measure the ability of our approach to solve randomly generated Sokoban boards. We show also that the approach is flexible enough to accommodate for the use of a learned model of dynamics. More precisely, the planning described in Algorithm 2 is performed using this model.
The model is trained using a set of trajectories obtained by a random (uniform) agent. The major difficulty in obtaining the model is learning (sparse) rewards. The ratio of non-zero reward transitions is less than 3e-6. To tackle this problem we generated a dataset consisting of boards with 1, 2, 3, 4 boxes and substantially upsample rewarded transitions.
To measure uncertainty we use an ensemble value function using relative majority voting as formalized in equation (*). Relative majority voting takes into account the uncertainty of ensembles when it comes to the final outcome, not only the uncertainty in assessment of particular action.
After 10 million steps in the real environment, our method reaches 89.0% win rate, compared to 84.7% of an agent not using ensembles, see the figure on the right (the win rate is calculate on the last 1000 games). We also see that using a learned model yields result practically equivalent to that of the perfect model.
As a measure of exploration we also present the size of the game graph explored during episodes (red curve in the figure on the right). It shows an interesting effect, which can be interpreted as a transition from exploration to exploitation approximately at a 40% win-rate mark or, equivalently, 10000 games.
Learning curve (left axis) and the size of explored graph (right axis) of Sokoban states. The shape of the latter plot may show a gradual switch from exploration to exploitation. The results are averaged over 10 random seeds (5 for experiments with learned model), shaded areas shows 95% confidence intervals.
Single-board Sokoban: learning to solve a single board
In this experiment we measure the extent in which our methods can plan and learn on single boards. We note that this setting differs substantially from the one in the previous paragraph. In the multiple-boards Sokoban there is a possibility of generalization from easier to harder scenarios (e.g. MCTS with randomly initialized value network solves ≈ 0.7% of boards). Such a transfer is not possible in the single-board case studied here. On the other hand, the algorithm gathers multiple episodes from the same board, which enables the agent to explore the board's state space.
For the single-board Sokoban we used experimental settings similar to the one used for Toy Montezuma's Revenge. We observe that the setup with ensembles solves 73% compared to 50% the standard training without ensembles (experiments with and without ensembles were performed on a common set of 250 randomly generated boards). The latter might seem surprisingly good, taking into account the sparse reward. This follows by the loop avoidance. If during planning the agent finds itself in a situation from which it cannot find a novel state a negative value, set to -2 in our experiments, is backpropagated to already seen vertices. We speculate that this introduces a form of implicit exploration.
In the singe-board Sokoban case we performed also experiments on (8,8) boards with 4 boxes obtaining success rate of 94% compared to 64% on the standard non-ensemble settings.
An example of a solution found by our system.
Sokoban: transfer and learnable ensembles
Generating any new board can be seen as a cost dimension along with sample complexity. This quite naturally happens in meta-learning problems. We tested how value functions learned on small number of boards perform on new, previously unseen, ones. We used the following protocol: we trained value function on fixed number of 10 games. To ease the training, we used relabelling akin to Andrychowicz et al. (2017). We evaluated these functions on other boards. It turns out that they are typically quite weak, which is not very surprising, as solving 10 boards does not give much chance to infer 'the general' solutions. Next, we used ensembles of the value functions. More precisely, we calculated the values of n=2,3 models and aggregated them using a small neural networks with one fully connected hidden layer. This network is learnable and trained using the standard setup. We observe that the quality increases with the number of value functions in the ensemble as summarized in the table on the right. We observed high variability of the results over seeds, which is to be expected as Sokoban levels significantly vary in difficulty. We also observe that maximal results for transfer increase with the number of value function, being approximately 10%, 11% and 12%. This further supports the claim that ensembling may lead to improved performance. In 5-layer experiment we use a network with 5 hidden convolutional neural network layers we compare this with an analogous 4 layers network. In the latter case, we obtain a weaker result. We speculate this might be due to the fact that larger networks generalize better, see e.g. Cobbe et al. (2019).
Voting as a Maximum Empirical Posterior Maximization
Planning with imperfect models
In experiments presented in the Sokoban section, we presented results of planning using a learned model. In this particular case, we were able to acquire a very good model using random trajectories (and some tricks). Such a situation is rarely the case, though. In fact, being able to use imperfect models is a central topic of model-based RL.
In future work, we plan to extend our method to handle learned models, possibly akin to the Dyna like training. In preliminary experiments, we artificially perturbed the prefect model of Sokoban (in the case of 8x8 boards with one box) to check planner robustness to model errors, see figure.
We observe that 1%, 2% of errors applied to observation slows learning but probably does not have much impact on the asymptotic performance. 10% of errors is significantly impends learning. We do not think it is a problem, though, as when the planner achieves a decent level of efficiency, it can be used to generate better data for the model training. Alternatively, various ways of mitigation of model errors can be used. Possibly the most related to this work would be to use ensemble-based techniques.