LLM-Craft: Robotic Crafting of Elasto-Plastic Objects with Large Language Models

Alison Bartsch¹ Amir Barati Farimani¹

¹Carnegie Mellon University Mechanical Engineering

[Arxiv Paper] [Supplemental Materials]

X Shape Experiment Videos Across Methods

Iter. Goal Img

Iter. No Goal Img

SC Goal Image

SC No Goal Img

ToT Goal Img

ToT No Goal Img

C

I

L

T

X

'bumpy'

NOTE: the letter goals are in the robot coordinate frame, meaning they are flipped with respect to this video frame so the non-symmetrical shapes appear upside down and mirrored about the y-axis.

Failure Case Example

Abstract

When humans create sculptures, we are able to reason about how geometrically we need to alter the clay state to reach our target goal. We are not computing point-wise similarity metrics, or reasoning about low-level positioning of our tools, but instead determining the higher-level changes that need to be made. In this work, we propose LLM-Craft, a novel pipeline that leverages large language models (LLMs) to iteratively reason about and generate deformation-based crafting action sequences. We simplify and couple the state and action representations to further encourage shape-based reasoning. To the best of our knowledge, LLM-Craft is the first system successfully leveraging LLMs for complex deformable object interactions. Through our experiments, we demonstrate that with the LLM-Craft framework, LLMs are able to successfully create a set of simple letter shapes. We explore a variety of rollout strategies, and compare performances of LLM-Craft variants with and without an explicit goal shape images.

LLM-Craft Pipeline

The LLM-Craft system takes a top-down image of the clay with a wrist-mounted camera as the state observation. A grid is applied to the image to represent the discrete regions of the clay for the LLM to reason about where to grasp. The LLM is then prompted with the gridded state and goal images as well as an action prompt. The LLM selects a sequence of grasps to apply to the clay, and the robot executes the first one. A new state observation is collected, and passed back to the LLM along with the goal image and the termination prompt to determine if the goal has been reached. If the goal has not been reached, the LLM is then queried with the state and goal images as well as the action prompt. This is an iterative process until the goal has been reached.

Rollout Strategies

A visualization of different method deployment strategies. We present chain of thought as our iterative base method, and self-consistency (SC) and tree of thought (ToT) are rollout strategies that require multiple queries of the LLM per action trajectory generation.

Long-Horizon Shapes

We compare various LLM rollout strategies (no replanning, iterative, SC, ToT), the case of controlling the strength of the squeeze, semantic sculpting versus a ground truth goal image, and a human oracle.

Human Classification Evaluations

The confusion matrix of the human evaluations for all methods. A score of 1.0 indicates all human respondents classified the clay image as that letter.

Grid Size Ablation

We compare the performance of LLM-Craft compared to a human baseline for variable grid sizes for the 'X' shape task.

Prompt Ablations

The performance of the long-horizon, iterative system on the X shape task as we remove different components of the prompt. For full details of the prompt variations, please see the Supplemental Materials. For each prompt variation, we conducted 5 hardware experimental runs. We report the mean human quality rating on a scale from 1-10, with the black bar indicating the standard deviation across experiments.

Semantic Goals

Given a semantic goal, i.e. how to qualitatively change the state without an explicit goal, LLM-Craft is able to successfully change the starting state to accordingly.

Shape Reasoning Examples

For every current and target shape, the LLM assigns a semantically meaningful shape to describe the rough shape. For example, in the first row the goal shape is called a dolphin and the right size of the clay is referred to as the tail. While these shape descriptors are very specific, the general shapes of the clay do somewhat match up, and the reasoning about where to interact with the clay is consistent as well.

Bibtex

@article{bartsch2024llmcraft,

title={LLM-Craft: Robotic Crafting of Elasto-Plastic Objects with Large Language Models},

author={Bartsch, Alison and Barati Farimani, Amir},

journal={arXiv preprint arXiv:2406.08648},

year={2024}}

Page updated

Report abuse