PLATO: Planning with LLMs and Affordances for Tool Manipulation

Arvind Car¹, Sai Sravan Yarlagadda¹, Alison Bartsch¹, Abraham George¹, Amir Barati Farimani¹

¹Carnegie Mellon University Mechanical Engineering

Abstract

As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO’s design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution.

Videos

"Pack a lunch of broccoli and red sausages"

"Flatten the pink ball of dough"

"Scoop up some candy"

"Flatten the ball of dough and poke holes in it"

"Scoop up the candy and pour it in the bowl"

"Place the toy strawberry next to the broccoli"

Framework

Overall Pipeline

Tool Affordance Model

The tool affordance model contains an LLM agent that is tasked with mapping the Query tools (at runtime) with the Database Tools (with predefined graspable regions). This information allows the affordance model to map the "database mask" (graspable region on the database tool) to the corresponding region on the query tool

Query Tools

Flattener, Scoop, Whisk, Fork

Database Tools

Pan, Spoon, Chisel, Hammer, Screwdriver, Marker, Pliers

Experimental Setup

Single-Task Grasping

To verify the functioning of the SAM Vision Module and the spatial reasoning capabilities of the framework. Objects and their positions were varied between trials.