Improving Intrinsic Exploration by Creating Stationary Objectives

Anonymous Submission

[Full Paper] [Code]

Abstract

Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Count-based methods use the frequency of state visits to derive an exploration bonus. In this paper, we identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.

Problem

The non-stationarity of intrinsic rewards induces a partially observable MDP (POMDP), as the dynamics of the reward distribution are unobserved by the agent. In a POMDP, there are no guarantees for an optimal Markovian (i.e. time-homogeneous) policy to exist.

Solution

We introduce a framework to define stationary objectives for exploration (SOFE). SOFE provides an intuitive algorithmic modification to eliminate the non-stationarity of the count-based intrinsic rewards, making the learning objective stable and stationary. Our proposed solution consists of augmenting the original states of the POMDP by including the state visitation frequencies. In this way, we effectively formulate the intrinsic reward as a deterministic function of the state.

Does our method facilitate the optimization of non-stationary exploration bonuses?

Although optimizing for the same intrinsic reward distribution, our method allows for better optimization of the objective.

The first row represents our method using both the count-based rewards and state augmentation (+ C. + Aug.), and the second row represents training with the count-based rewards only (+ C.).

How much does our method improve exploration for downstream tasks?

Our proposed augmentation achieves higher downstream task performance in the sparse reward settings

Augmented state representations generally provide statistically significant improvements in the goal-reaching task across RL algorithms and exploration modalities. Without exploration bonuses, agents fail to reach the goal during training.

How does our method perform when only approximations to state visitation frequencies are available?

We apply our proposed augmentation to E3B, the SOTA method for episodic exploration

Our proposed augmentation further improves the performance of the SOTA method for episodic exploration in procedurally generated environments

Our proposed augmentation is agnostic to the RL algorithm and environment specifications

minihack.mp4

[Extra] Behaviour Specification with Goal-Conditioned RL

Augmented agents pre-trained on episodic exploration can adapt to never-seen-before state visitation frequencies

To understand how the policy uses the augmented information in the states, we craft artificial state visitation frequencies that indicate that all the states have been visited but a specific one (denoted with a red box in the Figure below). We find that augmented agents effectively make use of the additional information to direct their exploration toward the unvisited states.

Evaluation of out-of-the-box goal-conditioned behaviours. Yellow boxes show the agent's starting positions. Red boxes show the goals' positions and green traces show the agent’s trajectory when conditioned with hand-crafted counts.

episodic_goal_eval_m1.mp4

Goal-conditioned to reach the upper-right corner

episodic_goal_eval_m2.mp4

Goal-conditioned to reach the upper-left corner

episodic_goal_eval_m3.mp4