Home

Deep Laplacian-based Options

for Temporally-Extended Exploration

Martin Klissarov

McGill University

Mila

Marlos C. Machado

Alberta Machine Intelligence Institute (Amii)

Department of Computing Science, University of Alberta

Canada CIFAR AI Chair

Work partially done while at DeepMind.

Paper

Code

Twitter

Blogpost

Abstract

Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian.

Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable.

In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.

What do the eigenfunctions of the graph Laplacian represent in RL?

The Eigenfunctions of the graph Laplacian have a rich history in the RL literature. The seminal paper of Mahadevan, 2005, introduces them as proto-value functions and argues that they constitute a useful representation in RL as they encode the properties of the environment at different timescales. Later on, Machado et al., 2017 further proposes to leverage them as a scaffold for learning temporal abstractions. They define the set of eigenoptions as the skills that maximize each of eigenfunctions of the graph Laplacian.

Visualizations of two of the eigenfunctions in the game of Montezuma's Revenge and a 3D navigation environment, as approximated in our algorithm.

Approximating the eigenfunctions in high dimensional space

The visualizations presented above were obtained in environments where it would be infeasible to actually perform the costly eigendecomposition on the graph Laplacian matrix (assuming this matrix can be obtained in the first place). Instead, we leverage recent advances (Wu et al. 2019, Wang et al., 2021) that directly approximate these eigenfunctions through the following objective:

The algorithm starts with a representation of the eigenfunctions that is essentially random (as shown below). It is only through the experience collected by the agent that we are able to obtain approximations to the eigenfunctions. We then leverage these eigenfunctions to define a diverse set of options, which when executed by the agent, empowers it to reach the boundaries of the visited regions. This then helps it cover and explore in a more efficient way. This is an effective instantiation of the representation-drive option discovery cycle.

State coverage

Dimensions of the Laplacian representation

At each stage of training, we visualize the eigenfunctions approximated in our algorithm, where each of 10 eigenfunctions shown is represented within a plot. Importantly, the agent has to discover the grids from a specific starting point. It is interesting to point out that by definition, such a represention encodes orthogonality in the space of eigenfunctions, which here leads to eigenoptions pointing in a diversity of directions.

A different approach to exploration

We compare out method, deep covering eigenoptions, or DCEO, with several state-of-the-art exploration baselines. To highlight an important advantage of exploring through temporal abstractions, we conduct a transfer learning experiment. Indeed, having a set of skills means we can better react to change by leveraing this reusable set of skills.

A generally applicable solution

Eigenoptions have mostly been so far shown as a competitive approach on simpler navigation tasks. We show in the following experiments that they are generally applicable to complex interactions, such as picking up keys to open doors, avoid obstacles and actually modifying the environment's topology by breaking walls to access keys.

A scalable method

We verify that the qualitative results on high dimensional environments, 3D navigation and Montezuma's Revenge, translate to quantitative gains.

What's next?

There are still many things we need to improve on, such as:

More sophisticated way to select between options
Planning with temporal abstraction
Deriving an on-policy version of the algorithm
Leveraging passive data (i.e. observations only)
Many more

Additionally there are different ways to estimate the eigenfunctions, such as this concurrent work by Farebrother et al. 2023. The eigenfunctions of the graph Laplacian have been around for a long time in RL, but a new wave of interest, together with a new set of methods, has recently emerged.

Citation

@inproceedings{klissarov2023deep,

title = {Deep Laplacian-based Options for Temporally-Extended Exploration},

author = {Martin Klissarov and Marlos C. Machado},

booktitle = {International Conference on Machine Learning (ICML)},

year = {2023}

}

Acknowledgements

The authors would like to thank Adam White, Andre Barreto, Tom Zahavy, Michael Bowling, Diana Borsa and Doina Precup for constructive feedback throughout this project, Andy Patterson, Josh Davidson and the whole DeepMind Alberta team for helpful and inspiring discussions. A special thanks to Kaixin Wang for sharing the code of the generalized Laplacian and Kris de Asis for the availability of the Rubik's Cube 2x2 implementation. The authors would also like to thank NSERC for partially funding this research.