Surprise Minimization in Reinforcement Learning
All living organisms struggle against the forces of nature to carve out niches where they can maintain relative stasis. We propose that such a search for order amidst chaos might offer a unifying principle for the emergence of useful behaviors in artificial agents.
We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing RL (SMiRL). SMiRL trains an agent with the objective of maximizing the probability of observed states under a model trained on all previously seen states.
The resulting agents acquire several proactive behaviors to seek and maintain stable states such as balancing and damage avoidance, that are closely tied to the affordances of the environment and its prevailing sources of entropy, such as winds, earthquakes, and other agents.
We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, and control a humanoid to avoid falls, without any task-specific reward supervision. We further show that SMiRL can be used as an unsupervised pre-training objective that substantially accelerates subsequent reward-driven learning.
The Code for the project can be found here.
Emergent Behaviors: policies learned with SMiRL without external rewards.
Tetris
VizDoom (HoldTheLine)
Cliff VAE
Treadmill VAE
SMiRL Analysis
We can plot the agent's belief about the board as it plays Tetris
We can also visualize the SMiRL reward as the agent plays in the ViZDoom environment
Applications of SMiRL
Imitation Learning
Learned policy for lower left block imitation.
Left: goal state. Right: SMiRL
Learned policy for checkerboard imitation.
Left: goal state. Right: SMiRL
SMiRL as a stability reward
ViZDoom "Defend the Line" scenario trained with SMiRL and a living reward
Biped trained with just forward reward
Biped trained with forward reward + smirl reward on demonstrations
Treadmill SMiRL + ICM
Pedestal VAE
SMiRL Exploration
miniGrid
Entropy Comparison on Atari
The RND trained policies on the right often have higher entropy than the SMiRL trained policies on the left.