Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards. However, since designing rewards often requires substantial engineering effort, we are interested in the problem of learning without rewards, where agents must discover useful behaviors in the absence of task-specific incentives. Intrinsic motivation is a family of unsupervised RL techniques which develop general objectives for an RL agent to optimize that lead to better exploration or the discovery of skills. In this paper, we propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences. The policies each take turns controlling the agent. The Explore policy maximizes entropy, putting the agent into surprising or unfamiliar situations. Then, the Control policy takes over and seeks to recover from those situations by minimizing entropy. The game harnesses the power of multi-agent competition to drive the agent to seek out increasingly surprising parts of the environment while learning to gain mastery over them. We show empirically that our method leads to the emergence of complex skills by exhibiting clear phase transitions. Furthermore, we show both theoretically---via a latent state space coverage argument---and empirically that our method has the potential to be applied to the exploration of stochastic, partially-observed environments. We show that Adversarial Surprise learns more complex behaviors, and explores more effectively than competitive baselines, outperforming intrinsic motivation methods based on active inference, novelty-seeking (Random Network Distillation (RND)), and multi-agent unsupervised RL (Asymmetric Self-Play (ASP)) in MiniGrid, Atari and VizDoom environments.
ProcGen MiniGrid
Adversarial Surprise
RND
SMIRL
Adversarial Surprise more fully explores the underlying environment, whereas RND becomes distracted by flashing lights (noisy TVs), and SMIRL stays in the nearest niche (dark room) he find.
Space Invaders
Adversarial Surprise
RND
SMIRL
Purely from optimizing intrinsic reward, we see that these three algorithms learn different behaviors. SMiRL learns to stay as close to the wall where it spawned as possible to minimize entropy, but doesn't shoot as many aliens. By maximizing entropy, RND moves around more and is able to shoot more aliens, but is more prone to death. Adversarial Surprise is able to learn a balance of exploring and staying safe.
Due to rendering issues in the Space Invaders environment, shootings are sometimes not displayed, even though targets get clearly shot.
Freeway
Adversarial Surprise
RND
SMIRL
In order to escape the highly entropic traffic, SMiRL learns better to cross all the way to the other side in Freeway. RND still learns to cross the highway but is distracted more in the middle as a result of this surprising traffic. Without any external rewards, Adversarial Surprise is able to learn to cross the road better than RND.
Assault
Adversarial Surprise
RND
SMiRL
Adversarial Surprise learns a policy such that the agent learns to find safe spots in the environment. This way, it can stay alive for longer while still exploring and shooting to get a higher environment reward.
Berzerk
Adversarial Surprise
RND
SMiRL
RND is more prone to hitting the walls and dying, and SMiRL will not go out of its way to shoot any robots. Adversarial Surprise finds a balance between shooting enemies and staying safe.