Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

Reinforcement Learning Conference 2024

Adriana Hugessen*


Roger Creus Castanyer*

Faisal Mohamed*


Glen Berseth


Abstract


Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment’s level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent  behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing  the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent’s ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviours across MinAtar tasks


(left) The Butterflies and (right) Maze environments. S-Min trains the agent to actively catch the butterflies in order to prevent diverse state configurations while at the same time preventing the agent to navigate around Maze. S-Max trains the agent to avoid catching butterflies while navigating the Maze efficiently. These two didactic environments show that current intrinsic objectives fail to provide generally-useful objectives for RL agents and cannot adapt.

Our proposed method (S-Adapt) learns emergent behaviours in multiple environments without access to task rewards

Butterflies (large)

Maze (small)

Breakout

Freeway

S-Min

Learns to induce low-entropy state distribution by not moving

S-Max

Learns to move up and down to induce a high-entropy state-distribution

S-Adapt

Learns a behaviour similar to S-Max but interestingly also achieves high task rewards

Entropy and Surprise

The notion of surprise derives from the optimization of the entropy of the state marginal distribution under the policy. Given an estimate of this state marginal distribution, we can express an estimate of the sum of the entropies of the state distribution across a trajectory.

We can see that minimizing the sum of the state entropy over a trajectory corresponds to an RL agent with a reward function given by:

and maximizing this objective corresponds to an RL agent with a reward function given by:

Our Method: Surprise-Adaptive Bandit


We propose a multi-armed bandit approach for selecting between minimizing or maximizing surprise. Precisely, at the start of each episode, we select an arm from the bandit according to the UCB algorithm (Lai et al., 1985), which determines if the agent will receive rewards according to S-Min or S-Max during the upcoming episode. The bandit receives feedback on its selection at the end of each episode.


The key question is how to provide feedback to the bandit, given access only to intrinsic rewards. We propose a feedback mechanism grounded in the observation that the general goal in both surprise minimization and surprise maximization is for the agent to be able to affect a change in the level of surprise it experiences. In a low-entropy environment, the agent can best affect change by increasing entropy, and vice versa. Hence, the bandit should receive feedback that reflects this agency. We propose using the absolute percent difference between the entropy of the state marginal distribution at the end of the m-th episode and that of a random agent in the same environment.

Experiments and analysis 

First, we consider how well our agents are able to control entropy across the didactic environments, Maze, Butterflies and Tetris. As expected, the S-Min agent achieves the lowest or near-lowest entropy in all environments, while the S-Max agent achieves the highest or near-highest entropy in all environments.

S-Min, S-Adapt and the Extrinsic agent solve the game (i.e. consistently survive for more than 500 steps). Interestingly, the surprise-minimizing objective, which S-Adapt converges to, turns out to be a better learning signal than the row-clearing extrinsic reward in Tetris, as the learned policies are more stable and the average episodic surprise is the lowest.

S-Max and S-Adapt are the only objectives that allow the RL agents to consistently find the goal in the maze.

S-Min

By minimizing the entropy of the state-marginal distribution, surprise-minimizing RL agents can solve Tetris

S-Max

Novelty-seeking objectives (e.g. curiosity-based) aim to induce high-entropy state-marginal distributions, and are not aligned with the game of Tetris

(ours) S-Adapt

Our method identifies that learning surprise-minimizing behaviours in Tetris allows for a larger difference in the entropy of the state-marginal distribution compared to the one from a random agent 

Also notable is that the extrinsic rewards generally correlate closely with one of these two behaviors in all environments. This suggests that these environments have good potential for entropy-based control to elicit emergent behaviors. Importantly, however, the extrinsic reward does not correlate well with either S-Min or S-Max in all environments. In Maze, S-Max achieves high rewards, while in Butterflies and Tetris, S-Min achieves high rewards.

Conclusion

Our experiments demonstrate encouraging results for a surprise-adaptive agent. The S-Adapt agent can select the objective with the more controllable landscape across both didactic environments and MinAtar environments. Moreover, the S-Adapt agent inherits the emergent behaviors of the single-objective agents and even shows some unique emergent behaviors in certain instances due to the complex and adaptive combination of entropy objectives. Further work is needed to understand exactly under what conditions such emergent behaviors can manifest, and how to elicit them more reliably. Moreover, an interesting extension to this work would be to apply an adaptive agent in the continual learning setting, where adaptation can occur at any time, not only at episode end.