Meta-Reinforcement Learning of Structured Exploration Strategies

Abstract: Exploration is a fundamental challenge in reinforcement learning (RL). Many of the current exploration methods for deep RL use task-agnostic objectives, such as information gain or bonuses based on state visitation. However, many practical applications of RL involve learning more than a single task, and prior tasks can be used to inform how exploration should be performed in new tasks. In this work, we explore how prior tasks can inform an agent about how to explore effectively in new situations. We introduce a novel gradientbased fast adaptation algorithm – model agnostic exploration with structured noise (MAESN) – to learn exploration strategies from prior experience. The prior experience is used both to initialize a policy and to acquire a latent exploration space that can inject structured stochasticity into a policy, producing exploration strategies that are informed by prior knowledge and are more effective than random action-space noise. We show that MAESN is more effective at learning exploration strategies when compared to prior meta-RL methods, RL without learned exploration strategies, and task-agnostic exploration methods. We evaluate our method on a variety of simulated tasks: locomotion with a wheeled robot, locomotion with a quadrupedal walker, and object manipulation.

Code : https://github.com/RussellM2020/maesn_suite.git

Comparison of meta-training performance across dense and sparse rewards

(Shown for wheeled locomotion task)

Meta-training with Sparse Reward

On meta-training with sparse reward, none of the methods learning to do anything as can be seen by curves of post-update reward on the wheeled locomotion environment.

Meta-training with Dense Reward

On meta-training with dense reward, both MAML and MAESN achieve quite good post-update reward as seen by the following curves. However, as seen from learned exploration schemes in the following section, we see that MAESN learns to explore while MAML does not, which transfers MAESN better to new sparse reward tasks


Effect of reward shaping on meta-learned exploration schemes

Here we consider the effect of reward shaping at meta-training time on the exploration schemes learned by MAESN (as shown in Fig 7 in the paper). We consider the block manipulation task, and we can see from the figures below that a ssparser reward leads to more varied exploration (left) than when you use an extremely dense reward (right).

Pusher exploration trajectories when reward includes only distance from block to goal

Pusher exploration trajectories when reward includes distance from block to goal and from pusher to block

Exploration Schemes Learned by MAESN and MAML

Random Exploration

Exploration learned by MAML

Exploration with MAESN (OURS)