Adriana Hugessen*
Roger Creus Castanyer*
Faisal Mohamed*
Glen Berseth
Abstract
Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment’s level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent’s ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviours across MinAtar tasks
(left) The Butterflies and (right) Maze environments. S-Min trains the agent to actively catch the butterflies in order to prevent diverse state configurations while at the same time preventing the agent to navigate around Maze. S-Max trains the agent to avoid catching butterflies while navigating the Maze efficiently. These two didactic environments show that current intrinsic objectives fail to provide generally-useful objectives for RL agents and cannot adapt.
Our proposed method (S-Adapt) learns emergent behaviours in multiple environments without access to task rewards
Butterflies (large)
Maze (small)
Breakout
Freeway
S-Min
Learns to induce low-entropy state distribution by not moving
S-Max
Learns to move up and down to induce a high-entropy state-distribution
S-Adapt
Learns a behaviour similar to S-Max but interestingly also achieves high task rewards
Entropy and Surprise
The notion of surprise derives from the optimization of the entropy of the state marginal distribution under the policy. Given an estimate of this state marginal distribution, we can express an estimate of the sum of the entropies of the state distribution across a trajectory.
We can see that minimizing the sum of the state entropy over a trajectory corresponds to an RL agent with a reward function given by:
and maximizing this objective corresponds to an RL agent with a reward function given by:
Our Method: Surprise-Adaptive Bandit
We propose a multi-armed bandit approach for selecting between minimizing or maximizing surprise. Precisely, at the start of each episode, we select an arm from the bandit according to the UCB algorithm (Lai et al., 1985), which determines if the agent will receive rewards according to S-Min or S-Max during the upcoming episode. The bandit receives feedback on its selection at the end of each episode.
The key question is how to provide feedback to the bandit, given access only to intrinsic rewards. We propose a feedback mechanism grounded in the observation that the general goal in both surprise minimization and surprise maximization is for the agent to be able to affect a change in the level of surprise it experiences. In a low-entropy environment, the agent can best affect change by increasing entropy, and vice versa. Hence, the bandit should receive feedback that reflects this agency. We propose using the absolute percent difference between the entropy of the state marginal distribution at the end of the m-th episode and that of a random agent in the same environment.
Experiments and analysis
First, we consider how well our agents are able to control entropy across the didactic environments, Maze, Butterflies and Tetris. As expected, the S-Min agent achieves the lowest or near-lowest entropy in all environments, while the S-Max agent achieves the highest or near-highest entropy in all environments.
S-Min, S-Adapt and the Extrinsic agent solve the game (i.e. consistently survive for more than 500 steps). Interestingly, the surprise-minimizing objective, which S-Adapt converges to, turns out to be a better learning signal than the row-clearing extrinsic reward in Tetris, as the learned policies are more stable and the average episodic surprise is the lowest.
S-Max and S-Adapt are the only objectives that allow the RL agents to consistently find the goal in the maze.
S-Min
By minimizing the entropy of the state-marginal distribution, surprise-minimizing RL agents can solve Tetris
S-Max
Novelty-seeking objectives (e.g. curiosity-based) aim to induce high-entropy state-marginal distributions, and are not aligned with the game of Tetris
(ours) S-Adapt
Our method identifies that learning surprise-minimizing behaviours in Tetris allows for a larger difference in the entropy of the state-marginal distribution compared to the one from a random agent
Also notable is that the extrinsic rewards generally correlate closely with one of these two behaviors in all environments. This suggests that these environments have good potential for entropy-based control to elicit emergent behaviors. Importantly, however, the extrinsic reward does not correlate well with either S-Min or S-Max in all environments. In Maze, S-Max achieves high rewards, while in Butterflies and Tetris, S-Min achieves high rewards.
Conclusion
Our experiments demonstrate encouraging results for a surprise-adaptive agent. The S-Adapt agent can select the objective with the more controllable landscape across both didactic environments and MinAtar environments. Moreover, the S-Adapt agent inherits the emergent behaviors of the single-objective agents and even shows some unique emergent behaviors in certain instances due to the complex and adaptive combination of entropy objectives. Further work is needed to understand exactly under what conditions such emergent behaviors can manifest, and how to elicit them more reliably. Moreover, an interesting extension to this work would be to apply an adaptive agent in the continual learning setting, where adaptation can occur at any time, not only at episode end.