Text2Motion: From Natural Language Instructions to Feasible Plans
Kevin Lin*, Christopher Agia*, Toki Migimatsu, Marco Pavone, Jeannette Bohg
Stanford University
Abstract
We propose Text2Motion, a language-based planning framework enabling robots to solve sequential manipulation tasks that require long-horizon reasoning. Given a natural language instruction, our framework constructs both a task- and motion-level plan that is verified to reach inferred symbolic goals. Text2Motion uses feasibility heuristics encoded in Q-functions of a library of skills to guide task planning with Large Language Models. Whereas previous language-based planners only consider the feasibility of individual skills, Text2Motion actively resolves geometric dependencies spanning skill sequences by performing geometric feasibility planning during its search. We evaluate our method on a suite of problems that require long-horizon reasoning, interpretation of abstract goals, and handling of partial affordance perception. Our experiments show that Text2Motion can solve these challenging problems with a success rate of 82%, while prior state-of-the-art language-based planning methods only achieve 13%. Text2Motion thus provides promising generalization characteristics to semantically diverse sequential manipulation tasks with geometric dependencies between skills.
Quick Links
Journal (link): Autonomous Robots. Special Issue: Large Language Models in Robotics, 2023
arXiv: https://arxiv.org/abs/2303.12153
Code: Coming soon.
Related research:
Text2Motion: At a glance
Problem statement. Large language models (LLMs) can readily convert instructions into high-level plans, but should we trust robots to execute these plans without verifying that they a) satisfy the instructions and b) are feasible in the real world?
Text2Motion is a language-based planner capable of solving long-horizon manipulation problems that require symbolic and geometric reasoning, all from natural language instructions. Planning with Text2Motion is a three step process:
An LLM is used to infer symbolic goals that a plan must achieve in order to satisfy the human's instruction
A hybrid shooting-based and search-based planner uses the LLM, a library of independently learned skills, and a geometric feasibility planner [1] to compute feasible plans
A feasible plan is executed iff the inferred symbolic goals hold in the dynamics-predicted final state of the plan
Result. When Text2Motion returns a plan, it both a) satisfies the provided natural language instruction and b) is geometrically feasible. This entire process occurs prior to plan execution.
Planning with Text2Motion
Evaluation Tasks
Text2Motion and our language-planning baselines are evaluated on challenging TAMP-like tasks that contain one or more of following properties:
Long Horizon (LH): Tasks that require executing 6+ consecutive skills to solve
Lifted Goals (LG): Where instructions are not specified over concrete object instances
Partial Affordance Perception (PAP):
Skill affordances are difficult for the LLM to determine from the textual scene description
Hybrid Planning
(Text2Motion)
(Text2Motion)
Text2Motion synergistically combines shooting- and search-based planning by optimistically invoking Shooting at each iteration, and falling back to a Greedy Search step if Shooting fails.
Our method considers the feasibility of entire skill sequences using STAP [1] and is thus equipped for tasks with long-horizon geometric dependencies.
Reactive Execution
(SayCan & Inner Monologue)
(SayCan & Inner Monologue)
SayCan [2] and Inner Monologue [3] do not perform explicit look-ahead planning. Instead, they execute the most useful (LLM) and feasible (Value Function) skill at the current timestep.
This myopic strategy suffices for simpler tasks, but the complex tasks considered here demand long-horizon geometric feasibility planning with skills.
Design Features
Search-Based vs. Shooting-Based Planning?
We compare Text2Motion's constituent planners:
Shooting, which queries the LLM only once to generate full skill sequences, and
Greedy Search, which queries the LLM numerous times to generate candidate skills.
Shooting is efficient and can consider a diverse set of plans, while Greedy Search is more reliable when it is difficult to guess feasible plans in "one go."
Task and Motion Planning (TAMP)?
Like TAMP [4], Text2Motion can solve tasks that feature rich symbolic and geometric complexities. We highlight some key difference below.
Text2Motion: Planning in the Real World
Task 1: "How would you pick and place all of the boxes onto the rack?"
Task properties: (Long Horizon)
Task properties: (Long Horizon)
Text2Motion (Ours)
Text2Motion performs geometric feasibility planning to ensure that earlier rack placements enable successful downstream placements.
Inner Monologue
Inner Monologue reactively executes feasible skills. In this case, the correct place(*, rack) skill gets a low value score under the action produced by an imperfect policy, causing objects to be placed back on the table.
Shooting
Shooting succeeds when high-level plans can be predicted from the instruction and scene description, since it also performs geometric feasibility planning.
Task 4: "How would you put one box onto the rack (hint you may use the hook)?"
Task properties: (Lifted Goals + Partial Affordance Perception)
Task properties: (Lifted Goals + Partial Affordance Perception)
Text2Motion (Ours)
Text2Motion uses its Greedy Search planner to ensure that pick(hook) is executed in a way that enables the downstream pull(cyan_box, hook) skill.
Inner Monologue
Inner Monologue first executes pick(hook) in a manner that prevents a follow-up pull skill. On second attempt, the agent luckily grasps the hook near the end of the handle, enabling a successful pull skill.
Shooting
No plan found was by Shooting (PAP task). Despite the hint, the LLM fails to deduce the use of the hook in any predicted skill sequence, and instead attempts to directly pick an object beyond the robot's workspace.
Task 6: "How would you put two primary-colored boxes onto the rack?"
Task properties: (Long Horizon + Lifted Goals + Partial Affordance Perception)
Task properties: (Long Horizon + Lifted Goals + Partial Affordance Perception)
Text2Motion (Ours)
Similar to the above task, Text2Motion is able to contend with long-horizon geometric dependencies even when the instruction contains Lifted Goals (LG).
Inner Monologue
Inner Monologue correctly determines the hook's use for pulling the blue object closer. But this time, it attempts the pull skill instead of re-grasping and fails.
Shooting
No plan found was by Shooting due to the Long-Horizon (LH), Lifted Goal (LG), and Partial Affordance Perception (PAP) properties of this challenging task.
Citation
If you found this work interesting, please consider citing:
@article{LinAgiaEtAl2023,
title={Text2Motion: from natural language instructions to feasible plans},
author={Lin, Kevin and Agia, Christopher and Migimatsu, Toki and Pavone, Marco and Bohg, Jeannette},
journal={Autonomous Robots},
year={2023},
month={Nov},
day={14},
issn={1573-7527},
doi={10.1007/s10514-023-10131-7},
url={https://doi.org/10.1007/s10514-023-10131-7}
}
Acknowledgements
References
[1] Agia, C., Migimatsu, T., Wu, J., & Bohg, J. (2023, May). Stap: Sequencing task-agnostic policies. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 7951-7958). IEEE.
[2] Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., ... & Yan, M. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
[3] Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., ... & Ichter, B. (2022). Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
[4] Garrett, C. R., Chitnis, R., Holladay, R., Kim, B., Silver, T., Kaelbling, L. P., & Lozano-Pérez, T. (2021). Integrated task and motion planning. Annual review of control, robotics, and autonomous systems, 4, 265-293.