LILA: Language-Informed Latent Actions
Siddharth Karamcheti*, Megha Srivastava*, Percy Liang, Dorsa Sadigh.
Department of Computer Science, Stanford University

Fusing Natural Language and Shared Autonomy for Assistive Teleoperation.
Paper | Code



We introduce Language-Informed Latent Actions (LILA), a framework for learning natural language interfaces in the context of human-robot collaboration. LILA falls under the shared autonomy paradigm: in addition to providing discrete language inputs, humans are given a low-dimensional controller – e.g., a 2 degree-of-freedom (DoF) joystick that moves left/right or up/down – for operating the robot.

LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like "place the cereal bowl on the tray," LILA might learn a 2-DoF control space where one dimension controls the distance from the robot's end-effector to the bowl, and the other dimension controls the robot's end-effector pose relative to the grasp point on the bowl.

We evaluate LILA with real-world user studies pairing users with a 7-DoF Franka Emika Panda Arm, with the goal of using our language-informed controller to complete a series of complex manipulation tasks. LILA models are not only more sample efficient and performant than imitation learning and end-effector control baselines, but are also qualitatively preferred by users.

LILA at a Glance

LILA falls within the shared autonomy paradigm wherein a human provides continuous input while being "assisted" by a robot; rather than enforce a sharp divide in agency, the human user and robot complete tasks together, sharing the load. In this work, we look at shared autonomy through the lens of assistive teleoperation, where a human user with a low-DoF control interface (e.g., a 2-DoF joystick) is trying to operate a high-DoF (7-DoF) robot manipulator.

Below, you see an example of our setup, and our key contribution:
using language within an assistive teleoperation framework (Learned Latent Actions) to specify and disambiguate a wide variety of tasks.

The above figure presents a top-down overview of LILA; a user, equipped with a low-dimensional controller (the 2-DoF joystick in black) is attempting to operate the 7-DoF Franka Emika Panda Arm (left) to perform a series of tasks.

To enable "intuitive" control, we use the build atop of the Learned Latent Actions paradigm to learn low-dimensional "latent action" spaces via dimensionality reduction – in our case, training an auto-encoder to compress, and decode task-specific demonstrations. One of our key contributions is adding natural language to this loop, enabling linguistically-aware models that users can use to complete a wide variety of diverse tasks.

Specifically, users can use natural language to switch into an intuitive control mode for completing a task!

Learning with LILA

Driving LILA is a conditional autoencoder (CAE) with two components – an Encoder (left) and a Decoder (right). The encoder takes in a (state, language, action) tuple, and predicts a latent "action" z. Similarly the decoder takes in the (state, language, latent action) tuple and tries to reproduce the high-dimensional (high-DoF) action a'.

Crucially, we use Feature-Wise Linear Modulation (or
FiLM) layers to incorporate language information - check out our paper and appendix for more information. When embedding language, we use a version of Distil-RoBERTa to obtain sentence representations that are passed to these FiLM layers.

At test time, to prevent drift from the training distribution, we use nearest-neighbors based retrieval to "project" new user utterances onto similar examples from the test set; not only is this easier to reason about, but leads to better sample efficiency as well!

LILA in Action – User Studies

Here we include videos from our user studies (anonymized), comparing LILA with Imitation Learning and End-Effector Control baselines.



Instruction: "Lift up the banana and place it in the purple basket."

Imitation Learning

Instruction: "Pick up the green bowl and put it on the tray."

End-Effector Control

No Instruction!
Goal: Pour(BlueCup, BlackMug)

The following shows 3 examples of users completing a single task with each strategy (each row denotes the same exact user)!


User 1 - LILA

Instruction: "Pick up the banana and put it in the purple basket."

User 1 - Imitation Learning

"Move left and grab the banana.
Move left one inch.
Move right one inch.
Move forward one inch.
Move upper diagonally.
Move to the side.
Move one inch.

User 1 - End-Effector Control

No Instruction!
Goal: Put(Banana, FruitBasket)


User 2 - LILA

Instruction: "Pick up the clear cup with the marbles in it and pour it in the black mug with the coffee beans in it."

User 2 - Imitation Learning

Instruction: "Pick up the clear container with the marbles in it and pour it into the mug with the coffee."

User 2 - End-Effector Control

No Instruction!
Goal: Pour(BlueCup, BlackMug)


User 3 - LILA

Instruction: "Grab the bowl and place it on the tray."

User 3 - Imitation Learning

Instruction: "Grab the cereal bowl and place it on the tray."

User 3 - End-Effector Control

No Instruction!
Goal: Put(CerealBowl, Tray)