SculptBot: Pre-Trained Models for 3D Deformable Object Manipulation

Alison Bartsch¹ Charlotte Avra¹ Amir Barati Farimani¹

¹Carnegie Mellon University Mechanical Engineering

[Paper]

Abstract

Deformable object manipulation presents a unique set of challenges in robotic manipulation by exhibiting high degrees of freedom and severe self-occlusion. State representation for materials that exhibit plastic behavior, like modeling clay or bread dough, is also difficult because they permanently deform under stress and are constantly changing shape. In this work, we investigate each of these challenges using the task of robotic sculpting with a parallel gripper. We propose a system that uses point clouds as the state representation and leverages pre-trained point cloud reconstruction Transformer to learn a latent dynamics model to predict material deformations given a grasp action. We design a novel action sampling algorithm that reasons about geometrical differences between point clouds to further improve the efficiency of model-based planners. All data and experiments are conducted entirely in the real world. Our experiments show the proposed system is able to successfully capture the dynamics of clay, and is able to create a variety of simple shapes.

X Edited.mp4

Target Shape: X

Cylinder Edited.mp4

Target Shape: Cylinder

Square Edited.mp4

Target Shape: Square

Line Edited.mp4

Target Shape: Line

Dynamics Prediction Pipeline

The entire dynamics prediction pipeline. We first use farthest point sampling and k-nearest neighbors to cluster the original point cloud into 64 clusters. These 64 clusters become a much smaller and sparser point cloud. We then use the pre-trained dVAE from Point-BERT to tokenize each cluster. The centroid point cloud is passed through a simple physics-based dynamics approximator to predict the next state centroid point cloud given the grasp action.  This predicted next state centroid point cloud is passed to the point token predictor dynamics model along with the state centroid tokenization and the grasp action. The point token predictor predicts the tokens for each next state centroid, which represent the geometrical structure of the points within that region of the cloud. These predicted tokens along with the predicted centroid point cloud are then passed through the dVAE decoder to reconstruct the full dense predicted next state point cloud.

Vision Processing

The full preprocessing pipeline for the point clouds. a) The original point cloud of the scene. b) the scene after position-based cropping to eliminate the elevated stage. c) The point cloud after color-based thresholding. d) The point cloud after removing statistical outliers and adding in a base plane. e) The point cloud downsampled to 2048 points.

Point-BERT Pre-Trained Model

The dVAE from Point-BERT provides quality reconstruction of the real-world clay point clouds without requiring any finetuning on our dataset. This allows us to train a latent dynamics model predicting the material deformation in the Point-BERT embedding space.

Model Next State Predictions

A visualization of some next state predictions on the test set by the latent dynamics model trained on the human demonstration dataset. It is clear the model is able to capture and predict the large geometric changes caused by various grasp actions. However, some of the details may not be captured, likely due to the shape reconstruction, as the quality appears similar to some of the detail lost during reconstruction. This loss is a side effect of leveraging the pre-trained model from Point-BERT, and is not sufficient to justify training our own point cloud encoder, as it would require substantially more data.

Sculpting Results

The shapes a human was able to create when guiding the robot and its parallel gripper (Left) compared to the shapes the human demonstration trained dynamics model combined with MPC and geometric sampling was able to create (Right). Visually, it is clear that our system is not able to perform better than human oracle, but based on the reconstruction metrics, it is able to successfully recreate the key structural aspects of the shapes.