Published at ICRA 2025

Language-Conditioned Offline RL

for Multi-Agent Navigation

Commanding multi-robot systems with natural language

High Level Objective

We want to be able to use natural language commands for robots:

"navigate to the top edge"

"the south-east corner is your target"

and learn policies that do this for multi-agent teams!

Mapping Language to Action in Realtime

1. Project natural language tasks into a well-behaved latent space using an LLM.

2. Train a task-conditioned policy to understand the latent space using offline RL.

Keeping the LLM out of the action loop reduces latency.

How do we get Enough Multi-Robot Data?

1. Collect real-world data from a single robot behaving randomly.

2. Simulate multiple robots by combining samples from the single-robot dataset

Many multi-agent tasks can be phrased as navigation tasks: formation keeping, flocking, target following etc, are all various forms of multi-agent navigation.

Results in the Real World

5agent_soft_2_0_train_5agent_soft_1_0_train_C0606.mp4

Five agents execute natural language tasks in realtime.

Agents, goals, and natural language tasks are color-coded. Their goals are overlaid as colored circles. All videos are at 1x speed.

Next, we demonstrate our approach on out-of-distribution natural language tasks not seen during training.

The policies still accomplish their tasks in most cases. For instance, the phrases "go to the" and "north east" were never seen during training, yet the agents still know what to do.

5agent_soft_2_0_eval_5agent_soft_1_0_eval_C0608.mp4

We propose a small modification to Q-Learning using Expected SARSA. We find that our approach outperforms Q-Learning and Conservative Q-Learning (CQL).

cql_train_cql_train_C0593.mp4

Conservative Q-Learning

In-distribution

max_train_max_train_C0591.mp4

Q-Learning

In-distribution

mean_train_mean_train_C0589.mp4

Expected SARSA Mean Q

In-distribution

soft_2_0_train_soft_2_0_train_C0587.mp4

Expected SARSA Soft Q

In-distribution

cql_eval_cql_eval_C0594.mp4

Conservative Q-Learning

Out-of-distribution

max_eval_max_eval_C0592.mp4

Q-Learning

Out-of-distribution

mean_eval_mean_eval_C0590.mp4

Expected SARSA Mean Q

Out-of-distribution

soft_2_0_eval_soft_2_0_eval_C0588.mp4

Expected SARSA Soft Q

Out-of-distribution

Offline Reinforcement Learning

We improve performance with a single-line code change.

Q-Learning (Expected SARSA with Greedy Policy)

Mean Q-Learning (Expected SARSA with Uniform Collection Policy)

Soft Q-Learning (Expected SARSA with Boltzmann Policy)

The Mean-Q and Soft-Q policies are theoretically suboptimal, but we find they perform better in practice. This mirrors the SARSA and Q-Learning comparison from the Sutton and Barto textbook.

Addendum, more evaluations

Does the method work for more complex and compound sentences?

Short answer: yes! More details here at Addendum.

See the experimental setup here.