Contributed Talks @ ICLR 2019 workshops on TARL and SPiRL [Slides]

To build a smart robot, we need to teach the robot to explore its environment to determine out what it is physically capable of doing and to figure out how to perform tasks for a human. In this project, we propose an algorithm that automatically teaches a robot how to explore to solve new tasks. As shown in the figure above, a simple approach results in very poor exploration, whereas our method teaches the robot to play with a block.

Environments

Navigation

The agent is spawned at the center of long hallways that extend radially outward. The agent’s task is to navigate to the end of a goal corridor. We can vary the length of the hallway and the number of halls to finely control difficulty of exploration. We consider two types of robots: a 2D point robot and a quadrupedal robot.

Manipulation

A robot controls a gripper to pick and place a block on top of a table surface. The robot’s task is to move the object to a goal location which is not observed by the robot, thus requiring the robot to explore by moving the block to different locations on the table.

Exploration in Actions vs. States

MaxEnt RL [Ziebart 2010] algorithms such as Soft Actor Critic (SAC) [Haarnoja 2018] maximize entropy over actions, which is often motivated as leading to good exploration. In contrast, the State Marginal Matching objective leads to maximizing entropy over states. We hypothesized that exploring in the space of states would be more effective than exploring in the space of actions.

On the navigation task, we measured how each method performed as we increased the exploration difficulty by increasing the number of hallways (# Arms). Our method, which maximizes entropy over states, consistently explores 60% of hallways, whereas MaxEnt RL, which maximizes entropy over actions, rarely visits more than 20% of hallways. Further, using mixtures of policies (N = 3, 5, 10) explores even better.

Comparison with Prior Exploration Methods

Test-time Exploration (Manipulation)

Prior work [Schmidhuber 1991, Bellemare 2016, Burda 2018, Pathak 2017] has proposed a number of mechanisms for exploration in RL. Empirically, we find that State Marginal Matching explores more quickly than baselines in the multi-task setting. (ICM = [Pathak 2017], PC = [Bellemare 2016])

Historical Averaging (Manipulation)

While it is often unclear what objective these exploration bonuses are maximizing, our paper discusses how prior work can be interpreted as almost doing distribution matching, but omitting a crucial historical averaging step (see Section 3.2). Empirically, we show that this historical averaging improves exploration.