Planning and Reasoning with 3D Deformable Objects for Hierarchical Text-to-3D Robotic Shaping

Alison Bartsch¹ Amir Barati Farimani¹

¹Carnegie Mellon University Mechanical Engineering

Abstract

Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task.

Some Experiment Videos

X Goal Shape

The discrete dough placement planner is able to create the general 'X' shape well. The action refinement module selects a sequence of actions that straighten arms more uniformly, and remove some of the bumps caused by the discrete placement of clay balls.

Airplane Goal Shape

The discrete dough placement planner generally creates a quality airplane with a fuselage and wings, but the tail region is not very visible. The refimenet LLM agent chooses to modify the tail region, making it more pronounced to fix this issue.

Line Goal Shape

The discrete dough placement planner creates a line with some curves and variations in the width. The refinement LLM agent chooses modifications to straighten the line and improve uniformity.

Text-to-3D Results

The human oracle is required to follow the same process of coarse-to-fine sculpting using their hands. The choice of camera orientation for each shape was to best visualize the full sculpture (i.e. top-down versus isometric viewpoint).

Semantic Tuning Results

By semantically tuning the prompt, our proposed system is able to adequately adapt the final sculpture it creates.

Shaping Sequence Visualized

A visualization of the sculpting sequence for our proposed text-to-3D shaping method. Our pipeline first creates a coarse shape in the scene with discrete chunks of clay, and then iteratively refines the shape with deformation-based actions.

Vision Pipeline

The point cloud processing pipeline first captures a dense point cloud of the robot's workspace (a), then with position and color thresholding we are able to isolate the clay pointcloud (b), next the point cloud is clustered into 10 regional geometrical patches (c), and finally uniformly down-sampled to ensure each cluster contains an equal number of points.

Learned Action Model

a) The full direct action model pipeline, showcasing the cluster-based observation space, the siamese-style PointNet embedding network, and the action network to predict grasp actions from a latent observation embedding. b) The synthetic pre-training pipeline with the objective to predict the weighted CD and EMD difference between clusters. c) The real-world action model training pipeline in which the weights of the point cloud embedding module are frozen.

Text-to-3D Real-World Metrics Analysis

Scatter plot with line of best fit for the CLIP and PointCLIP-v2 cosine similarity of text and image/point cloud embeddings of 10 human trajectories creating each shape in clay. The line of best fit's slope for each shape and prompt shows how well the CLIP or PointCLIP-v2 score correlates with our human oracle-created shapes and varying prompts.

Bibtex

@article{bartsch2024,

title={Planning and Reasoning with 3D Deformable Objects for Hierarchical Text-to-3D Robotic Shaping},

author={Bartsch, Alison and Farimani, Amir Barati},

journal={arXiv preprint arXiv:2412.01765},

year={2024}}

Page updated

Report abuse