Improving Intrinsic Exploration by Creating Stationary Objectives

Roger Creus Castanyer

Joshua Romoff

Glen Berseth

Presentation

Abstract

Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Several exploration objectives like count-based bonuses, pseudo-counts, and state-entropy maximization are non-stationary and hence are difficult to optimize for the agent. While this issue is generally known, it is usually omitted and solutions remain under-explored. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. We show that SOFE improves the performance of several exploration objectives, including count-based bonuses, pseudo-counts, and state-entropy maximization. Moreover, SOFE outperforms prior methods that attempt to stabilize the optimization of intrinsic objectives. We demonstrate the efficacy of SOFE in hard-exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.

Problem

The non-stationarity of intrinsic rewards induces a partially observable MDP (POMDP), as the dynamics of the reward distribution are unobserved by the agent. In a POMDP, there are no guarantees for an optimal Markovian (i.e. time-homogeneous) policy to exist.

Solution

We introduce a framework to define stationary objectives for exploration (SOFE). SOFE provides an intuitive algorithmic modification to eliminate the non-stationarity of the count-based intrinsic rewards, making the learning objective stable and stationary. Our proposed solution consists of augmenting the original states of the POMDP by including sufficient statistics of the intrinsic reward distributions. In this way, we effectively formulate the intrinsic reward as a deterministic function of the state.

Does SOFE facilitate the optimization of non-stationary exploration bonuses?

Although optimizing for the same intrinsic reward distribution, SOFE allows for better optimization of the objective.

SOFE & Count-based bonuses

The first row represents SOFE, which uses both the count-based rewards and state augmentation (+ C. + Aug.), and the second row represents training with the count-based rewards only (+ C.). 

SOFE improves count-based bonuses and is robust to the environment specifications (e.g. continuous action spaces, procedurally-generated environments, open-world 3D navigation).

SOFE & State-Entropy Maximization

SOFE provides orthogonal gains across multiple exploration objectives and RL algorithms.

How much does SOFE improve exploration for downstream tasks?

Our proposed augmentations achieve higher downstream task performance in the sparse reward settings

SOFE generally provides statistically significant improvements in sparse-reward tasks across RL algorithms and exploration modalities. Without exploration bonuses, agents fail to find the extrinsic rewards during training.

We compare SOFE to DeRL in the DeepSea environment. DeRL entirely decouples the training process of an exploratory policy from the exploitation policy to stabilize the optimization of the exploration objective. SOFE is less complex than DeRL as it only requires training an additional feature extractor. Still, SOFE achieves better results in the harder variations of the DeepSea environment.

How does SOFE perform when only approximations to state visitation frequencies are available?

We apply our proposed augmentation to E3B, the SOTA method for episodic exploration

SOFE further improves the performance of the SOTA method for episodic exploration in procedurally generated environments

SOFE is agnostic to the RL algorithm and environment specifications


minihack.mp4

[Extra] Analysis of the behaviours learned by SOFE

SOFE agents pre-trained on episodic exploration can adapt to never-seen-before state visitation frequencies

To understand how the policy uses the augmented information in the states, we craft artificial state visitation frequencies that indicate that all the states have been visited but a specific one (denoted with a red box in the Figure below). We find that augmented agents effectively make use of the additional information to direct their exploration toward the unvisited states.

Evaluation of out-of-the-box goal-conditioned behaviours. Yellow boxes show the agent's starting positions. Red boxes show the goals' positions and green traces show the agent’s trajectory when conditioned with hand-crafted counts.

episodic_goal_eval_m1.mp4

Goal-conditioned to reach the upper-right corner

episodic_goal_eval_m2.mp4

Goal-conditioned to reach the upper-left corner

episodic_goal_eval_m3.mp4

Goal-conditioned to reach the bottom-right corner