From Language to Goals:

Inverse Reinforcement Learning for Vision-based Instruction Following

Justin Fu, Anoop Korattikara, Sergey Levine, Sergio Guadarrama

Abstract: Reinforcement learning is a promising framework for solving control problems, but its use in practical situations is hampered by the fact that reward functions are often difficult to engineer. Specifying goals and tasks for autonomous machines, such as robots, is a significant challenge: conventionally, reward functions and goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How can we build learning algorithms that will allow us to tell machines what we want them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments. We propose language-conditioned reward learning (LC-RL), which grounds language commands as a reward function represented by a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a language-conditioned policy leads to poor performance.

Paper

Reward functions for Instruction Following

Why are reward functions a good fit for instruction following? (as compared to learning a language-conditioned policy)

Intuition: Language-conditioned rewards allow for a natural separation between understanding the intentions of command versus how to execute command.

Consider the two above mazes, and the command "go to the green star". The reward function is identical in both mazes. However, if we directly transferred a policy it would run into a wall and fail. Planning and problem solving are orthogonal to the language comprehension problem!

Learning Rewards with Deep Multitask Inverse Reinforcement Learning

We introduce language-conditioned deep inverse reinforcement learning, depicted by the diagram below. Our algorithm, based on MaxEnt IRL, alternates between updating a reward function and solving the optimal policy under that reward function. To incorporate language into IRL, we can view the language as part of the state/observation of the system, which is fixed for an entire episode. The reward function maps from this augmented state to a scalar reward value.

The IRL gradient is given by the following equation. IRL is often intractable in large environments because it requires solving for the optimal policy at each iteration under the current reward (this means RL is an inner loop!).

To study the language problem in isolation, we evaluate in tabular environments, where we can use value iteration to compute optimal policies quickly. However, because our reward function maps from observations (images) and language to rewards, we can still evaluate it anywhere. This means we can use it at test time on unknown environments.

Reward Learning in an Indoor Scene Simulator

We build our environment off of the SUNCG simulator. Our agent moves on a grid, and is allowed to turn, walk forward, and pick up or drop certain objects within the environment. An example of a random walk in one of the environments is shown on the right (agent is marked as a green triangle in the birds-eye view).

We consider two types of tasks:

  • Navigation (NAV): The agent must navigate to a target location of object. Language commands are of the form "Go to X", where X is the name of a location such as "living room" or an object such as "fruit bowl".

  • Pick-and-place (PICK): The agent must pick up an object and drop it somewhere else. Language commands are of the form "Move X to Y", where X is the name of an object and Y is the name of a location.

An example learned reward is shown on the right. The agent is commanded to move the fruit bowl to the bathroom. We see that the reward places high mass in the bathroom area (red arrow) after the fruit bowl is placed down. A few noisy artifacts can be seen, such as higher reward around the fruit bowl in it's initial position (green arrow).

Blue in the plot means high reward/value, red means low.

Sample Tasks

Here we include additional examples of learned rewards (using IRL, reward regression, or a state-only version of GAIL). As in the previous diagram, red denotes low reward and blue denotes high reward.

We notice that rewards learned via direct reward regression tend to have less extraneous noise, whereas those learned via language-conditioned IRL or GAIL have minor artifacts.

Task

Inverse RL

GAIL

Ground-truth Regression

Go to fruit bowl

Move pan to living room