Text2Motion: From Natural Language Instructions to Feasible Plans

Abstract

We propose Text2Motion, a language-based planning framework enabling robots to solve sequential manipulation tasks that require long-horizon reasoning. Given a natural language instruction, our framework constructs both a task- and motion-level plan that is verified to reach inferred symbolic goals. Text2Motion uses feasibility heuristics encoded in Q-functions of a library of skills to guide task planning with Large Language Models. Whereas previous language-based planners only consider the feasibility of individual skills, Text2Motion actively resolves geometric dependencies spanning skill sequences by performing geometric feasibility planning during its search. We evaluate our method on a suite of problems that require long-horizon reasoning, interpretation of abstract goals, and handling of partial affordance perception. Our experiments show that Text2Motion can solve these challenging problems with a success rate of 82%, while prior state-of-the-art language-based planning methods only achieve 13%. Text2Motion thus provides promising generalization characteristics to semantically diverse sequential manipulation tasks with geometric dependencies between skills.

Quick Links

Journal (link): Autonomous Robots. Special Issue: Large Language Models in Robotics, 2023

arXiv: https://arxiv.org/abs/2303.12153

Code: Coming soon.

Related research:

text2motion-supplementary.mp4

Text2Motion: At a glance

Problem statement. Large language models (LLMs) can readily convert instructions into high-level plans, but should we trust robots to execute these plans without verifying that they a) satisfy the instructions and b) are feasible in the real world?

Text2Motion is a language-based planner capable of solving long-horizon manipulation problems that require symbolic and geometric reasoning, all from natural language instructions. Planning with Text2Motion is a three step process:

Result. When Text2Motion returns a plan, it both a) satisfies the provided natural language instruction and b) is geometrically feasible. This entire process occurs prior to plan execution.

Planning with Text2Motion

Evaluation Tasks

Text2Motion and our language-planning baselines are evaluated on challenging TAMP-like tasks that contain one or more of following properties:

Hybrid Planning
(Text2Motion)

Text2Motion synergistically combines shooting- and search-based planning by optimistically invoking Shooting at each iteration, and falling back to a Greedy Search step if Shooting fails. 

Our method considers the feasibility of entire skill sequences using STAP [1] and is thus equipped for tasks with long-horizon geometric dependencies.

Reactive Execution
(SayCan & Inner Monologue)

SayCan [2] and Inner Monologue [3] do not perform explicit look-ahead planning. Instead, they execute the most useful (LLM) and feasible (Value Function) skill at the current timestep.

This myopic strategy suffices for simpler tasks, but the complex tasks considered here demand long-horizon geometric feasibility planning with skills.

Design Features

Search-Based vs. Shooting-Based Planning?
We compare Text2Motion's constituent planners:

Shooting is efficient and can consider a diverse set of plans, while Greedy Search is more reliable when it is difficult to guess feasible plans in "one go." 

Task and Motion Planning (TAMP)?
Like TAMP [4], Text2Motion can solve tasks that feature rich symbolic and geometric complexities. We highlight some key difference below.

Text2Motion: Planning in the Real World

Task 1: "How would you pick and place all of the boxes onto the rack?" 
Task properties: (Long Horizon)

3box-t2m.mp4

Text2Motion (Ours)

Text2Motion performs geometric feasibility planning to ensure that earlier rack placements enable successful downstream placements.

3box-IM.mp4

Inner Monologue

Inner Monologue reactively executes feasible skills. In this case, the correct place(*, rack) skill gets a low value score under the action produced by an imperfect policy, causing objects to be placed back on the table. 

3box-H.mp4

Shooting

Shooting succeeds when high-level plans can be predicted from the instruction and scene description, since it also performs geometric feasibility planning.

Task 4: "How would you put one box onto the rack (hint you may use the hook)?" 
Task properties: (Lifted Goals + Partial Affordance Perception)

1box-t2m.mp4

Text2Motion (Ours)

Text2Motion uses its Greedy Search planner to ensure that  pick(hook) is executed in a way that enables the downstream pull(cyan_box, hook) skill. 

1box-IM.mp4

Inner Monologue

Inner Monologue first executes pick(hook) in a manner that prevents a follow-up pull skill.  On second attempt, the agent luckily grasps the hook near the end of the handle, enabling a successful  pull skill.

Shooting

No plan found was by Shooting (PAP task). Despite the hint, the LLM fails to deduce the use of the hook in any predicted skill sequence, and instead attempts to directly pick an object beyond the robot's workspace.

Task 6: "How would you put two primary-colored boxes onto the rack?" 
Task properties: (Long Horizon + Lifted Goals + Partial Affordance Perception)

2color-t2m.mp4

Text2Motion (Ours)

Similar to the above task, Text2Motion is able to contend with long-horizon geometric dependencies even when the instruction contains Lifted Goals (LG).

2color-IM.mp4

Inner Monologue

Inner Monologue correctly determines the hook's use for pulling the blue object closer. But this time, it attempts the pull skill instead of re-grasping and fails.

Shooting

No plan found was by Shooting due to the Long-Horizon (LH), Lifted Goal (LG), and Partial Affordance Perception (PAP) properties of this challenging task.

Citation

If you found this work interesting, please consider citing:

@article{Lin2023,

  title={Text2Motion: from natural language instructions to feasible plans},

  author={Lin, Kevin and Agia, Christopher and Migimatsu, Toki and Pavone, Marco and Bohg, Jeannette},

  journal={Autonomous Robots},

  year={2023},

  month={Nov},

  day={14},

  issn={1573-7527},

  doi={10.1007/s10514-023-10131-7},

  url={https://doi.org/10.1007/s10514-023-10131-7}

}

Acknowledgements

References

[1] Agia, C., Migimatsu, T., Wu, J., & Bohg, J. (2022). STAP: Sequencing Task-Agnostic Policies. arXiv preprint arXiv:2210.12250. 

[2] Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., ... & Yan, M. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.

[3] Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., ... & Ichter, B. (2022). Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.

[4] Garrett, C. R., Chitnis, R., Holladay, R., Kim, B., Silver, T., Kaelbling, L. P., & Lozano-Pérez, T. (2021). Integrated task and motion planning. Annual review of control, robotics, and autonomous systems, 4, 265-293.