Continuous MDP Homomorphisms and Homomorphic Policy Gradient

Sahand Rezaei-Shoshtari1,2, Rosie Zhao1,2, Prakash Panangaden1,2, David Meger1,2, Doina Precup1,2,3

1 McGill University, 2 Mila - Québec AI Institute, 3 DeepMind


In the Thirty-sixth Conference on Neural Information Processing Systems. NeurIPS 2022.

TL;DR: We define continuous MDP homomorphisms, and derive a homomorphc policy gradient theorem, which allows for using MDP homomorphisms in continuous control problems.

Abstract

Abstraction has been widely studied as a way to improve the efficiency and generalization of reinforcement learning algorithms. In this paper, we study abstraction in the continuous-control setting. We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces. We derive a policy gradient theorem on the abstract MDP, which allows us to leverage approximate symmetries of the environment for policy optimization. Based on this theorem, we propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. We demonstrate the effectiveness of our method on benchmark tasks in the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance when learning from pixel observations.

Introduction and Motivation

  • Abstractions are key to improving sample efficiency and generalization of RL agents.

  • Learning state abstractions in a scalable fashion for continuous control remains a key challenge.

  • MDP homomorphisms are a joint state-action abstraction that can also represent symmetries of an MDP:

Symmetries of an inverted pendulum.

Continuous MDP Homomorphisms

  • First, we formally define continuous MDP homomorphisms as:

Visualization of an MDP homomorphism.

Optimal Value and Value Equivalence

  • We prove that continuous MDP homomorphisms preserve values and optimal values:

Homomorphic Policy Gradient (HPG)

  • Finally, we derive the Homomorphic Policy Gradient (HPG) theorem:

  • This means that we can perform HPG on the transitions of the abstract MDP to obtain an additional gradient estimator for the performance measure of actual MDP!

Deep Homomorphic Policy Gradient (DHPG) Algorithm

  • We propose the Deep Homomorphic Policy Gradient (DHPG) algorithm for simultanously learning the optimal policy and the MDP homomorphism map.

  • DHPG uses both the determinisitc policy gradient (DPG) and the homomoprhic policy gradient (HPG) theorems to update the same policy parameters.

  • DHPG uses the lax bisimulation metric to learn the MDP homomorphism map.

Experimental Results

  • We propose the Deep Homomorphic Policy Gradient (DHPG) algorithm.

  • Results are obtained on DeepMind Control Suite, on states and pixels.

  • We report interquartile mean (IQM) and performance profiles aggregated on all tasks over 10 seeds.

HPG improves policy optimization and representation learning

Sample efficiency.

Performance profiles at 500k steps.

Qualitative properties of the learned representations and abstract MDP

Actual optimal policy. Abstract optimal policy.

DHPG can recover the minimal MDP image from raw pixel observations

Experiments with a limited size latent space.

References

[1] Ravindran, B. and Barto, A.G., 2001. Symmetries and model minimization in markov decision processes.

[2] Taylor, J., Precup, D. and Panagaden, P., 2008. Bounding performance loss in approximate MDP homomorphisms. Advances in Neural Information Processing Systems, 21.

[3] van der Pol, E., Worrall, D., van Hoof, H., Oliehoek, F. and Welling, M., 2020. MDP homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33.

[4] Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C. and Bellemare, M., 2021. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34.