PLATO: Planning with LLMs and Affordances for Tool Manipulation
Arvind Car¹, Sai Sravan Yarlagadda¹, Alison Bartsch¹, Abraham George¹, Amir Barati Farimani¹
¹Carnegie Mellon University Mechanical Engineering
As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO’s design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution.
Overall Pipeline
Tool Affordance Model
The tool affordance model contains an LLM agent that is tasked with mapping the Query tools (at runtime) with the Database Tools (with predefined graspable regions). This information allows the affordance model to map the "database mask" (graspable region on the database tool) to the corresponding region on the query tool
Flattener, Scoop, Whisk, Fork
Pan, Spoon, Chisel, Hammer, Screwdriver, Marker, Pliers
Single-Task Grasping
To verify the functioning of the SAM Vision Module and the spatial reasoning capabilities of the framework. Objects and their positions were varied between trials.
"Place the plastic strawberry next to the plastic broccoli"
"Pack a lunch of toy broccoli and red sausages"
Single-Task Tool Use
To test the ability of the framework to decompose high-level actions to low-level robot commands. Object positions were varied between trials
"Flatten the pink ball of dough using a flattening tool"
"Scoop up the candy"
"Perform a whisking action inside the bowl"
Multi-Task Tool Use
To test the ability of the framework to generate compound action sequences and execute them. Object positions were varied between trials
"Scoop the candy and put it inside the bowl"
"Flatten the dough and poke holes in it"
"Flatten the dough and pour candy onto it"
The vision module is sensitive to the exact prompt used. "Candy pile" works much better than just "candy"
Since the affordance model is zero-shot, it is prone to choosing poor grasps. It performs much better in one-shot situations
The Step Planner's discrete action space (go-to, rotate, grasp) makes it difficult to represent complex commands like "whisk."
@misc{car2024platoplanningllmsaffordances,
title={PLATO: Planning with LLMs and Affordances for Tool Manipulation},
author={Arvind Car and Sai Sravan Yarlagadda and Alison Bartsch and Abraham George and Amir Barati Farimani},
year={2024},
eprint={2409.11580},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2409.11580},
}