Efficient Exploration via State Marginal Matching

Lisa Lee, Benjamin Eysenbach, Emilio Parisotto*, Eric Xing, Sergey Levine, and Ruslan Salakhutdinov

Paper: https://arxiv.org/abs/1906.05274

Code: https://github.com/RLAgent/state-marginal-matching

Contributed Talks @ ICLR 2019 workshops on TARL and SPiRL [Slides]

We recast exploration as a problem of State Marginal Matching (SMM), where we aim to learn a policy for which the state marginal distribution matches a given target state distribution, which can incorporate prior knowledge about the task. Our theoretical analysis of this approach suggests that prior exploration methods do not learn a policy that does distribution matching, but acquire a replay buffer that performs distribution matching, an observation that potentially explains prior methods’ success in single-task settings.

Manipulation

The robot controls a gripper to pick and place a block on top of a table surface. As shown in the figure above, a simple approach results in very poor exploration, whereas our method teaches the robot to play with a block.

D'Claw Trained on Real Robot.mov

D'Claw Robotic Hand

We trained SMM and SAC on the valve turning task with the D’Claw robotic hand [Ahn et al., 2019], where the target distribution places uniform mass over all object angles [-180°, 180°]. SMM learns to move the knob more and visits a wider range of states than SAC.

Exploration in State Space (SMM) vs. Action Space (SAC)

MaxEnt RL [Ziebart 2010] algorithms such as Soft Actor Critic (SAC) [Haarnoja 2018] maximize entropy over actions, which is often motivated as leading to good exploration. In contrast, the State Marginal Matching objective leads to maximizing entropy over states. We hypothesized that exploring in the space of states would be more effective than exploring in the space of actions.

On the Navigation task, we measured how each method performed as we increased the exploration difficulty by increasing the number of hallways. SMM consistently explores 60% of hallways, whereas MaxEnt RL rarely visits more than 20% of hallways. Further, using mixtures of policies (N = 3, 5, 10) explores even better.

Comparison with Prior Exploration Methods

Test-time Exploration (Manipulation)

Prior work [Schmidhuber 1991, Bellemare 2016, Burda 2018, Pathak 2017] has proposed a number of mechanisms for exploration in RL. Empirically, we find that State Marginal Matching explores more quickly than baselines in the multi-task setting. (ICM = [Pathak 2017], PC = [Bellemare 2016])

Historical Averaging (Manipulation)

While it is often unclear what objective these exploration bonuses are maximizing, our paper discusses how prior work can be interpreted as almost doing distribution matching, but omitting a crucial historical averaging step (see Section 3.2). Empirically, we show that this historical averaging improves exploration.