We recast exploration as a problem of State Marginal Matching (SMM), where we aim to learn a policy for which the state marginal distribution matches a given target state distribution, which can incorporate prior knowledge about the task. Our theoretical analysis of this approach suggests that prior exploration methods do not learn a policy that does distribution matching, but acquire a replay buffer that performs distribution matching, an observation that potentially explains prior methods’ success in single-task settings.
The robot controls a gripper to pick and place a block on top of a table surface. As shown in the figure above, a simple approach results in very poor exploration, whereas our method teaches the robot to play with a block.
D'Claw Robotic Hand
We trained SMM and SAC on the valve turning task with the D’Claw robotic hand [Ahn et al., 2019], where the target distribution places uniform mass over all object angles [-180°, 180°]. SMM learns to move the knob more and visits a wider range of states than SAC.
Exploration in State Space (SMM) vs. Action Space (SAC)
MaxEnt RL [Ziebart 2010] algorithms such as Soft Actor Critic (SAC) [Haarnoja 2018] maximize entropy over actions, which is often motivated as leading to good exploration. In contrast, the State Marginal Matching objective leads to maximizing entropy over states. We hypothesized that exploring in the space of states would be more effective than exploring in the space of actions.
On the Navigation task, we measured how each method performed as we increased the exploration difficulty by increasing the number of hallways. SMM consistently explores 60% of hallways, whereas MaxEnt RL rarely visits more than 20% of hallways. Further, using mixtures of policies (N = 3, 5, 10) explores even better.
Comparison with Prior Exploration Methods
Test-time Exploration (Manipulation)
Historical Averaging (Manipulation)
While it is often unclear what objective these exploration bonuses are maximizing, our paper discusses how prior work can be interpreted as almost doing distribution matching, but omitting a crucial historical averaging step (see Section 3.2). Empirically, we show that this historical averaging improves exploration.