POLCO

Policy Optimization with Language Constraints

Abstract

In this paper, we tackle the problem of learning control policies for tasks when provided with constraints in natural language. In contrast to instruction following, language here is used not to specify goals, but rather to describe situations that an agent must avoid during its exploration of the environment. Specifying constraints in natural language also differs from the predominant paradigm in safe reinforcement learning, where safety criteria are enforced by hand-defined cost functions. While natural language allows for easy and flexible specification of safety constraints and budget limitations, its ambiguous nature presents a challenge when mapping these specifications into representations that can be used by techniques for safe reinforcement learning. To address this, we develop a model that contains two components: (1) a constraint interpreter to encode natural language constraints into vector representations capturing spatial and temporal information on forbidden states, and (2) a policy network that uses these representations to output a policy with minimal constraint violations. Our model is end-to-end differentiable and we train it using a recently proposed algorithm for constrained policy optimization. To empirically demonstrate the effectiveness of our approach, we create a new benchmark task for autonomous navigation with crowd-sourced free-form text specifying three different types of constraints. Our method outperforms several baselines by achieving 6.92 times higher returns and 76.86% fewer constraint violations on average.

Reproducibility

We open source our dataset and the running environment.

Model Demo

Here, the agent (red triangle) collects the reward entities (ball, box, and key) while obeying the safety constraint "go through lava less than three times."

Here, we take the agent trained in the first environment (Hazard World) and then test in the second environment (LavaWall). We see that the agent can generalize to unseen environments and obey the safety constraint.