LILA: Language-Informed Latent Actions
Siddharth Karamcheti*, Megha Srivastava*, Percy Liang, Dorsa Sadigh.
Department of Computer Science, Stanford University


Fusing Natural Language and Shared Autonomy for Assistive Teleoperation.
Paper | Code

Summary

lila-thumbnail.mov

Abstract

We introduce Language-Informed Latent Actions (LILA), a framework for learning natural language interfaces in the context of human-robot collaboration. LILA falls under the shared autonomy paradigm: in addition to providing discrete language inputs, humans are given a low-dimensional controller – e.g., a 2 degree-of-freedom (DoF) joystick that moves left/right or up/down – for operating the robot.

LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like "place the cereal bowl on the tray," LILA might learn a 2-DoF control space where one dimension controls the distance from the robot's end-effector to the bowl, and the other dimension controls the robot's end-effector pose relative to the grasp point on the bowl.

We evaluate LILA with real-world user studies pairing users with a 7-DoF Franka Emika Panda Arm, with the goal of using our language-informed controller to complete a series of complex manipulation tasks. LILA models are not only more sample efficient and performant than imitation learning and end-effector control baselines, but are also qualitatively preferred by users.

LILA at a Glance

LILA falls within the shared autonomy paradigm wherein a human provides continuous input while being "assisted" by a robot; rather than enforce a sharp divide in agency, the human user and robot complete tasks together, sharing the load. In this work, we look at shared autonomy through the lens of assistive teleoperation, where a human user with a low-DoF control interface (e.g., a 2-DoF joystick) is trying to operate a high-DoF (7-DoF) robot manipulator.

Below, you see an example of our setup, and our key contribution:
using language within an assistive teleoperation framework (Learned Latent Actions) to specify and disambiguate a wide variety of tasks.

The above figure presents a top-down overview of LILA; a user, equipped with a low-dimensional controller (the 2-DoF joystick in black) is attempting to operate the 7-DoF Franka Emika Panda Arm (left) to perform a series of tasks.

To enable "intuitive" control, we use the build atop of the Learned Latent Actions paradigm to learn low-dimensional "latent action" spaces via dimensionality reduction – in our case, training an auto-encoder to compress, and decode task-specific demonstrations. One of our key contributions is adding natural language to this loop, enabling linguistically-aware models that users can use to complete a wide variety of diverse tasks.

Specifically, users can use natural language to switch into an intuitive control mode for completing a task!

Learning with LILA

Driving LILA is a conditional autoencoder (CAE) with two components – an Encoder (left) and a Decoder (right). The encoder takes in a (state, language, action) tuple, and predicts a latent "action" z. Similarly the decoder takes in the (state, language, latent action) tuple and tries to reproduce the high-dimensional (high-DoF) action a'.

Crucially, we use Feature-Wise Linear Modulation (or
FiLM) layers to incorporate language information - check out our paper and appendix for more information. When embedding language, we use a version of Distil-RoBERTa to obtain sentence representations that are passed to these FiLM layers.

At test time, to prevent drift from the training distribution, we use nearest-neighbors based retrieval to "project" new user utterances onto similar examples from the test set; not only is this easier to reason about, but leads to better sample efficiency as well!

LILA in Action – User Studies

Here we include videos from our user studies (anonymized), comparing LILA with Imitation Learning and End-Effector Control baselines.

lila_banana.mp4
il_bowl.mp4
endeff_cup.mp4

LILA

Instruction: "Lift up the banana and place it in the purple basket."

Imitation Learning

Instruction: "Pick up the green bowl and put it on the tray."

End-Effector Control

No Instruction!
Goal: Pour(BlueCup, BlackMug)

The following shows 3 examples of users completing a single task with each strategy (each row denotes the same exact user)!

User-1-LILA.mp4
User-1-Imitation.mp4
User-1-EndEff.mp4

User 1 - LILA

Instruction: "Pick up the banana and put it in the purple basket."

User 1 - Imitation Learning

Instruction:
"Move left and grab the banana.
Move left one inch.
Move right one inch.
Move forward one inch.
Move upper diagonally.
Move to the side.
Move one inch.
"

User 1 - End-Effector Control

No Instruction!
Goal: Put(Banana, FruitBasket)

User-2-LILA.mp4
User-2-Imitation.mp4
User-2-EndEff.mp4

User 2 - LILA

Instruction: "Pick up the clear cup with the marbles in it and pour it in the black mug with the coffee beans in it."

User 2 - Imitation Learning

Instruction: "Pick up the clear container with the marbles in it and pour it into the mug with the coffee."

User 2 - End-Effector Control

No Instruction!
Goal: Pour(BlueCup, BlackMug)

User-3-LILA.mp4
User-3-Imitation.mp4
User-3-EndEff.mp4

User 3 - LILA

Instruction: "Grab the bowl and place it on the tray."

User 3 - Imitation Learning

Instruction: "Grab the cereal bowl and place it on the tray."

User 3 - End-Effector Control

No Instruction!
Goal: Put(CerealBowl, Tray)