Efficient Empowerment Estimation

for Unsupervised Stabilization

Ruihan Zhao, Kevin Lu, Stas Tiomkin, Pieter Abbeel

UC Berkeley

[OpenReview] [Github] [Bibtex]

Abstract

Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efficiently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches:

  • lower sample complexity

  • more stable in training

  • possesses the essential properties of the empowerment function

  • allows estimation of empowerment from images

Main Contribution

Latent Gaussian Channel Empowerment (Latent-GCE)


Intrinsically motivated RL with Latent-GCE consists of 4 main stages:

  • Embed states/observations and T-step action sequences to a latent space.

  • Fit Linear Gaussian forward dynamic model.

  • Compute Empowerment by Water-Filling

  • Update control policy using empowerment as an intrinsic reward

Results

Unsupervised Stabilization of Classic Dynamic Systems

Left: Ball-in-box environment. Empowerment is highest at the center, where the ball is free to move in all directions.

Right: Pendulum. Empowerment is highest at the up-right position, where gravity assists in reaching other states.

All three methods are successful in the simple ball-in-box environment. Only our method consistently balances the pendulum at an unstable equilibrium.

Stability of Empowerment Estimators

We compute the Relative Standard Deviation (RSD) of average empowerment values over 10 seeds as a measurement of each algorithm's stability. Our method outperforms the existing sample-based methods in that it has lower variance in training and faster convergence. These properties make our method competent as a reliable unsupervised training signal.

Empowerment Landscape at Convergence

This figure shows a qualitative comparison of empowerment value landscapes at convergence.

Top: Ball-in-box environment. All estimators converge to the correct landscape with high empowerment values in the center.
Bottom: Pendulum environment. Only Karl et al. (2019) and Latent-GCE succeed in creating results similar to the analytical solution.

Empowerment from Pixels

Using two consecutive 64x64 images as observations, the latent Gaussian channel learns the underlying dynamics, verified by comparing the decoded latent predictions. Latent-GCE learns a correct value landscape and a successful swing-up policy.