Accelerating Reinforcement Learning
with Value-Conditional State Entropy Exploration

Dongyoung Kim, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

[Paper] [Code]

Abstract

A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploit the task reward. Such a preference can cause an imbalance between the distributions of high-value states and low-value states, which biases exploration towards low-value state regions as a result of the state entropy increasing when the distribution becomes more uniform. This issue is exacerbated when high-value states are narrowly distributed within the state space, making it difficult for the agent to complete the tasks. In this paper, we present a novel exploration technique that maximizes the value-conditional state entropy, which separately estimates the state entropies that are conditioned on the value estimates of each state, then maximizes their average. By only considering the visited states with similar value estimates for computing the intrinsic bonus, our method prevents the distribution of low-value states from affecting exploration around high-value states, and vice versa. We demonstrate that the proposed alternative to the state entropy baseline significantly accelerates various reinforcement learning algorithms across a variety of tasks within MiniGrid, DeepMind Control Suite, and Meta-World benchmarks.

Motivation

Research Question: How to prevent the imbalance within visited state distribution from affecting exploration?

Method: Value-Conditional State Entropy Exploration

Intuitive Illustration

Overview of Our Method (VCSE)

We compute the state norms and value norms between pairs of samples within a minibatch and sort the samples based on their maximum norms. Then we find the k-th nearest neighbor among samples and use the distance to it as an intrinsic reward. Namely, our method excludes the samples whose values significantly differ for computing the intrinsic reward. Then we train our RL agent to maximize the sum of the intrinsic reward and the extrinsic reward.

Experimental Results

We evaluate our method on MiniGrid, DeepMind Control Suite, and Meta-World benchmarks. Please see our paper for more experiments, including additional analysis, ablation, and experiments on different setups!

MiniGrid Experiments

DeepMind Control Suite Experiments

Meta-World Experiments

Bibtex

To be added