Surprise Minimization in Reinforcement Learning

All living organisms struggle against the forces of nature to carve out niches where they can maintain relative stasis. We propose that such a search for order amidst chaos might offer a unifying principle for the emergence of useful behaviors in artificial agents.

We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing RL (SMiRL). SMiRL trains an agent with the objective of maximizing the probability of observed states under a model trained on all previously seen states.

The resulting agents acquire several proactive behaviors to seek and maintain stable states such as balancing and damage avoidance, that are closely tied to the affordances of the environment and its prevailing sources of entropy, such as winds, earthquakes, and other agents.

We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, and control a humanoid to avoid falls, without any task-specific reward supervision. We further show that SMiRL can be used as an unsupervised pre-training objective that substantially accelerates subsequent reward-driven learning.


The Code for the project can be found here.


Emergent Behaviors: policies learned with SMiRL without external rewards.

tetris-surprise.mp4

Tetris

VizDoom (HoldTheLine)

cliff_surpise_VAE_6_v3_rewardViz.mp4

Cliff VAE

treadmill_surpise_VAE_6_v3_rewardViz.mp4

Treadmill VAE

SMiRL Analysis

We can plot the agent's belief about the board as it plays Tetris

belief.mp4

We can also visualize the SMiRL reward as the agent plays in the ViZDoom environment

vizdoom-rews.mp4

Applications of SMiRL

Imitation Learning

Learned policy for lower left block imitation.

Left: goal state. Right: SMiRL

tetris-imitation-rectangle.mp4

Learned policy for checkerboard imitation.

Left: goal state. Right: SMiRL

tetris-imitation-checkerboard.mp4

SMiRL as a stability reward

ViZDoom "Defend the Line" scenario trained with SMiRL and a living reward

vizdoom-dtl-surprise-joint.mp4

Biped trained with just forward reward

biped_just_forward_eval_0_fixed_cam.mp4

Biped trained with forward reward + smirl reward on demonstrations

SMiRL_bonus_eval_0_fixed_cam.mp4
treadmill_surpise_ICM_v3_rewardViz.mp4

Treadmill SMiRL + ICM

pedistal_surpise_v3_rewardViz.mp4

Pedestal VAE

SMiRL Exploration

miniGrid

Entropy Comparison on Atari

The RND trained policies on the right often have higher entropy than the SMiRL trained policies on the left.

Assault SMiRL

Assault RND

Assault_SMIRL.mp4
Assault_RND.mp4

Berzerk SMiRL

Berzerk_SMiRL.mp4
Berzerk_RND.mp4

Carnival SMiRL

Carnival_SMiRL.mp4
Carnival_RND.mp4

Montezuma's Revenge SMiRL

Mont_SMiRL.mp4
Mont_RND.mp4

RiverRaid SMiRL

RiverRade_SMiRL.mp4
RiverRaid_RND

SpaceInvaders SMiRL

SpaceInvaders_SMIRL.mp4
SpaceInvaders_RND.mp4