Guiding Policies with Language via Meta-Learning

John D. Co-Reyes Abhishek Gupta Suvansh Sanjeev Nick Altieri Jacob Andreas John DeNero Pieter Abbeel Sergey Levine

UC Berkeley

Abstract: Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration. Instruction following from natural language instructions provides an appealing alternative: in the same way that we can specify goals to other humans simply by speaking or writing, we would like to be able to specify tasks for our machines. However, a single instruction may be insufficient to fully communicate our intent or, even if it is, may be insufficient for an autonomous agent to actually understand how to perform the desired task. In this work, we propose an interactive formulation of the task specification problem, where iterative language corrections are provided to an autonomous agent, guiding it in acquiring the desired skill. Our proposed language-guided policy learning algorithm can integrate an instruction and a sequence of corrections to acquire new skills very quickly. In our experiments, we show that this method can enable a policy to follow instructions and corrections for simulated navigation and manipulation tasks, substantially outperforming direct, non-interactive instruction following.

Method Overview

A human provides an agent with an ambiguous instruction. The agent is unable to fully deduce the user's intent from the instruction alone and attempts the task. The human provides iterative language corrections on the agent's behavior to gradually guide the agent to the optimal behavior. Our method is concerned with meta-learning policies that can ground language corrections in their environment and use them to improve through iterative feedback.

Model

The architecture of our model. The instruction module embeds the initial instruction L. The correction module embeds the trajectory and correction of that trajectory along with the previous corrections and trajectories. The features from these corrections are pooled and provided to the policy module. The policy module takes in the current state, embedded instruction, and embedded correction to output an action distribution.

Learning Examples (Multi-Room Object Manipulation Environment)

Right: the multi-room object manipulation environment which is partially observed. The task is to pickup a particular object (goal object) in a certain room and bring it to a goal location (goal square) in another room. The agent only sees a 7x7 grid around it and cannot see past walls or closed doors.

Below: examples of trajectories are shown in the multi-room object manipulation environment. Each row of gifs corresponds to successive trajectories on a given environment, and the caption of each gif is the correction generated on the trajectory shown. The goal of our evaluation is to determine whether we can use corrections to iteratively improve the policy on new tasks.

Success Cases:

Instruction: Move the blue ball to the yellow goal.

Enter the green room.

Pick up the blue ball.

Enter the blue room.

Pick up the blue ball.

Solved.

Instruction: Move the yellow square to the red goal.

Enter the red room.

Enter the yellow room.

Go to the purple goal.

Solved.

Instruction: Move the yellow ball to the green goal.

Enter the blue room.

Enter the red room.

Solved.

Instruction: Move the yellow ball to the green goal.

Enter the yellow room.

Enter the purple room.

Solved.

Failure Cases:

Instruction: Move the yellow square to the yellow goal.

Enter the grey room.

Pick up the grey triangle.

Enter the green room.

Pick up the grey triangle.

Go to the yellow goal.

Enter the grey room.

Instruction: Move the grey triangle to the blue goal.

Pick up the grey triangle.

Enter the yellow room.

Pick up the grey triangle.

Go to the blue goal.

Pick up the grey triangle.

Learning Examples (Robotic Object Relocation)

Right: the robotic object relocation environment, which is fully observed. The task is to push one of the three movable, colored objects in front of it (goal object) to the red square (goal square). The other colored objects are immovable obstacles.

Below: examples of trajectories are shown in the robotic object relocation environment. Each row of gifs corresponds to successive trajectories on a given environment, and the caption of each gif is the correction generated on the trajectory shown. The goal of our evaluation is to determine whether we can use corrections to iteratively improve the policy on new tasks.