Commanding multi-robot systems with natural language
We want to be able to use natural language commands for robots:
"navigate to the top edge"
"the south-east corner is your target"
and learn policies that do this for multi-agent teams!
1. Project natural language tasks into a well-behaved latent space using an LLM.
2. Train a task-conditioned policy to understand the latent space using offline RL.
Keeping the LLM out of the action loop reduces latency.
1. Collect real-world data from a single robot behaving randomly.
2. Simulate multiple robots by combining samples from the single-robot dataset
Many multi-agent tasks can be phrased as navigation tasks: formation keeping, flocking, target following etc, are all various forms of multi-agent navigation.
Five agents execute natural language tasks in realtime.
Agents, goals, and natural language tasks are color-coded. Their goals are overlaid as colored circles. All videos are at 1x speed.
Next, we demonstrate our approach on out-of-distribution natural language tasks not seen during training.
The policies still accomplish their tasks in most cases. For instance, the phrases "go to the" and "north east" were never seen during training, yet the agents still know what to do.
We propose a small modification to Q-Learning using Expected SARSA. We find that our approach outperforms Q-Learning and Conservative Q-Learning (CQL).
Conservative Q-Learning
In-distribution
Q-Learning
In-distribution
Expected SARSA Mean Q
In-distribution
Expected SARSA Soft Q
In-distribution
Conservative Q-Learning
Out-of-distribution
Q-Learning
Out-of-distribution
Expected SARSA Mean Q
Out-of-distribution
Expected SARSA Soft Q
Out-of-distribution
We improve performance with a single-line code change.
Q-Learning (Expected SARSA with Greedy Policy)
Mean Q-Learning (Expected SARSA with Uniform Collection Policy)
Soft Q-Learning (Expected SARSA with Boltzmann Policy)
The Mean-Q and Soft-Q policies are theoretically suboptimal, but we find they perform better in practice. This mirrors the SARSA and Q-Learning comparison from the Sutton and Barto textbook.
Does the method work for more complex and compound sentences?
Short answer: yes! More details here at Addendum.
See the experimental setup here.
Read more about Expected SARSA here.