Surprise Minimization in Reinforcement Learning

All living organisms struggle against the forces of nature to carve out niches where they can maintain relative stasis. We propose that such a search for order amidst chaos might offer a unifying principle for the emergence of useful behaviors in artificial agents.

We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing RL (SMiRL). SMiRL trains an agent with the objective of maximizing the probability of observed states under a model trained on all previously seen states.

The resulting agents acquire several proactive behaviors to seek and maintain stable states such as balancing and damage avoidance, that are closely tied to the affordances of the environment and its prevailing sources of entropy, such as winds, earthquakes, and other agents.

We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, and control a humanoid to avoid falls, without any task-specific reward supervision. We further show that SMiRL can be used as an unsupervised pre-training objective that substantially accelerates subsequent reward-driven learning.


Emergent Behaviors: policies learned with SMiRL without an external reward.

tetris-surprise.mp4

Tetris

ViZDoom

cliff_surpise_VAE_6_v3_rewardViz.mp4

Cliff VAE

treadmill_surpise_VAE_6_v3_rewardViz.mp4

Treadmill VAE

SMiRL Analysis

We can plot the agent's belief about the board as it plays Tetris

belief.mp4

We can also visualize the SMiRL reward as the agent plays in the ViZDoom environment

vizdoom-rews.mp4

Applications of SMiRL

Imitation Learning

Learned policy for lower left block imitation.

Left: goal state. Right: SMiRL

tetris-imitation-rectangle.mp4

Learned policy for checkerboard imitation.

Left: goal state. Right: SMiRL

tetris-imitation-checkerboard.mp4

SMiRL as a stability reward

ViZDoom "Defend the Line" scenario trained with SMiRL and a living reward

vizdoom-dtl-surprise-joint.mp4

Biped trained with just forward reward

biped_just_forward_eval_0_fixed_cam.mp4

Biped trained with forward reward + smirl reward on demonstrations

SMiRL_bonus_eval_0_fixed_cam.mp4
treadmill_surpise_ICM_v3_rewardViz.mp4

Treadmill SMiRL + ICM

pedistal_surpise_v3_rewardViz.mp4

Pedestal VAE

SMiRL Exploration

miniGrid

miniGrid