ToolFlowNet: Robotic Manipulation with Tools via

Predicting Tool Flow from Point Clouds

Daniel Seita, Yufei Wang*, Sarthak Shetty*, Edward Li*, Zackory Erickson, David Held

*Equal second-author contribution.

Robotics Institute, Carnegie Mellon University

Conference on Robot Learning (CoRL), 2022

Table of Contents

Abstract

Point clouds are a widely available and canonical data modality which convey the 3D geometry of a scene. Despite significant progress in classification and segmentation from point clouds, policy learning from such a modality remains challenging, and most prior works in imitation learning focus on learning policies from images or state information. In this paper, we propose a novel framework for learning policies from point clouds for robotic manipulation with tools. We use a novel neural network, ToolFlowNet, which predicts dense per-point flow on the tool that the robot controls, and then uses the flow to derive the transformation that the robot should execute.  We apply this framework to imitation learning of challenging deformable object manipulation tasks with continuous movement of tools, including scooping and pouring, and demonstrate significantly improved performance over baselines which do not use flow. We perform 50 physical scooping experiments with ToolFlowNet and attain 82% scooping success.

CoRL 2022 Presentation

Below is our 1-minute presentation video for CoRL 2022.

CoRL_2022_ToolFlowNet_Presentation_v03.mp4

You can find the slides here, which also contain the transcript (in the slide notes).

You can also find additional slides here for our "poster" session.

ToolFlowNet

Here's a visualization of how ToolFlowNet works, using the ScoopBall example where a segmented point cloud is the input, and the output is the tool transformation (i.e., an action). In the segmented point cloud, we color the ladle using black points and the ball using  yellow points (see below for some animations of the segmented point cloud).

ToolFlowNet works by passing the segmented point cloud through a segmentation PointNet++ backbone. After obtaining per-point outputs, which we interpret as "flow," we extract only the tool flow points (and ignore the ones corresponding to the ball). A differentiable SVD layer converts this flow to a transformation which consists of a change in translation and rotation for the tool. In ScoopBall, the rotation center is the tip of the ladle's stick, and in PourWater (the other main environment we use) it is the center of the bottom of the controlled cup.

To clarify: we are not training a network to do segmentation. We call it a segmentation PointNet++ because the architecture is normally used for segmenting point clouds.

Physical Experiments

Success Cases

The videos below show test-time trials of the trained ToolFlowNet model on the physical setup. Before each trial, a human arbitrarily tosses the target ping-pong ball onto the water. For a given trial to be classified as a success, the robot has to raise the target to above the inner, translucent box without causing collisions  with the rest of the physical setup. These videos are shown at 5X speed.

vertical.mp4

Failure Cases

We present two instances of ToolFlowNet failures with the physical experiments. A given trial is classified as a failure if it ends up colliding with the rest of the physical setup or if the robot fails to locate and raise the ball to the appropriate height. These videos are shown at 5X speed.

fail_side_by_side_1.mp4

Simulation Experiments

The Environments

We test two simulation tasks, PourWater and ScoopBall. Furthermore, to study the benefits of ToolFlowNet under different action spaces, we test 3DoF and 6DoF actions for PourWater and 4DoF and 6DoF actions for ScoopBall. For simulation, we build on SoftGym, which uses the NVIDIA FleX physics engine. All of these environments use continuous control with a time horizon of 100 actions.

The 3DoF variant of PourWater is from SoftGym while the 6DoF variant is a more challenging version which allows the full range of translations and rotations. For the 6DoF variant, the starting cup with water is randomized to start in more complex configurations. The center of rotation is the bottom center of the cup (which starts with water). A success is when the agent gets 75% of the water particles into the target container.

The 4DoF variant of ScoopBall limits the tool to one rotation dimension (about the tip of the ladle's stick) and enables all translations. The 6DoF variant allows for full rotations of the ladle. The center of rotation is the tip of the ladle stick. Note that for the 6DoF variant, we use a different ladle model with a hole at the bottom to allow water to drain through. A success is when the agent brings the ball above a height threshold and maintains it for 10 consecutive time steps.

Above, we visualize the two action spaces for PourWater (left) and ScoopBall (right). To indicate the translation, we show the 3 coordinate axes at the tool frame; we do not show one of the three coordinate frames for 3DoF PourWater since it only translates in a 2D plane. To indicate the rotation, we show a curved arrow about any coordinate axis vector if the action space supports a corresponding rotation.

For simulation, we show GIFs of both demonstration data and learned policies below, where each single GIF represents one "episode" of 100 time steps. Note that while we show RGB frames in GIFs for visualization purposes, ToolFlowNet only gets segmented point clouds as input.

Demonstration Data and Flow Visualizations

For each of the environments and action spaces, we implement scripted algorithmic demonstrators. Visualizations of these are shown below. When collecting data, we support extracting both RGBD images and point cloud observational data. We also extract tool flow from point clouds.

PourWater (3DoF and 6DoF)

Here are representative demonstration rollouts for PourWater, for 3DoF (top row) and 6DoF (bottom row) action spaces. For both action spaces, the algorithmic demonstrator moves the box to the target, lifts, and pours. The 3DoF action space limits the agent to moving the cup (that starts with water) in a 2D plane and 1D rotation, while the 6DoF action space allows for 3D translations and 3D rotations. In the 6DoF case, the demonstrator first rotates the starting cup so that it is aligned with the target, and then moves and pours. 

This video shows an example PourWater (6DoF) demonstration. It has three frames side by side at each time step showing: (1) RGB images, (2) segmented point clouds, and (3) tool flow. The three frames are all time-aligned, though the camera angles differ slightly. For the segmented point cloud, the black points represent the tool, the yellow points represent the target box, and the red points represent the water. For the tool flow, the blue points represent the tool, and the red vectors represent the ground-truth tool flow (slightly enlarged for visual clarity).

PourWater6D.mp4

ScoopBall (4DoF and 6DoF)

Here are representative demonstration rollouts for ScoopBall, for 4DoF (top row) and 6DoF (bottom row) action spaces.  For the 4DoF action space, the algorithmic demonstrator lowers the ladle, and rotates while continually translating to get the ball, and then rotates back to the neutral position. The rotation is designed to make sure the ladle's "bowl" is facing the direction of the ball. (There is sometimes an initial movement where the ladle moves which we implemented mainly to increase the success rates of the demonstrator.) For the 6DoF action space, we script a full 3DoF rotation to rotate the ladle and then  have it translate towards the ball. Then the ladle rotates back near its starting rotation while lifting the ladle upwards. We use a ladle with a hole in it to let water drain, since this greatly improved the success rate of the algorithmic demonstrator since the water particles in FleX simulation can often push balls out un-naturally.

This video shows an example ScoopBall (6DoF) demonstration. It has three frames side by side at each time step showing: (1) RGB images, (2) segmented point clouds, and (3) tool flow. The three frames are all time-aligned, though the camera angles differ slightly. For the segmented point cloud, the black points represent the tool and the yellow points represent the target ball to scoop. We do not use water in the point clouds in this environment. For the tool flow, the blue points represent the tool, and the red vectors represent the ground-truth tool flow (slightly enlarged for visual clarity).

ScoopBall6D.mp4

ToolFlowNet: Learned Policy Rollouts

We present GIFs of test-time rollouts of the ToolFlowNet policy, trained with Behavioral Cloning. These are on held-out starting configurations (unseen in training).

PourWater (3DoF and 6DoF)

Left: A representative set of 25 rollouts of the learned policy from ToolFlowNet after 500 epochs of Behavioral Cloning on PourWater (3DoF). This particular set of 25 evaluation episodes shows 17/25 successes. The most common failure case, as shown below, is when the policy does not lift the box with water.

Right: A representative set of 25 rollouts of the learned policy from ToolFlowNet after 500 epochs of Behavioral Cloning on PourWater (6DoF) with training on 100 demonstrations. This particular set of 25 evaluation episodes shows 16/25 successes. Failures can happen due to imprecision with pouring by missing the target, or (in rarer cases) the policy being unable to pour at all. Bigger cups/boxes can potentially cause difficulties as well.

Below, we show an example successful ToolFlowNet rollout at test time for PourWater (6DoF), showing the RGB images (left), segmented point cloud inputs (middle) and the tool points with predicted flow vectors in red (right), where the red flow vectors are enlarged for visual clarity. See this page for additional test-time videos.

ep_00_ALL.mp4

ScoopBall (4DoF and 6DoF)

Left: A representative set of 25 rollouts of the learned policy from ToolFlowNet after 500 epochs of Behavioral Cloning on ScoopBall (3DoF) with training on 100 demonstrations. This particular set of 25 evaluation episodes shows 15/25 successes. The typical failure cases include (1) the ball being pushed underwater due to FleX physics artifacts, and (2) the ball falling out of the ladle as the ladle moves upwards, due to water particles "pushing" it away.

Right: A representative set of 25 rollouts of the learned policy from ToolFlowNet after 500 epochs of Behavioral Cloning on ScoopBall (6DoF) with training on just 25 demonstrations. This particular set of 25 evaluation episodes shows 25/25 successes

Below, we show an example successful ToolFlowNet rollout at test time for ScoopBall (6DoF), showing the RGB images (left), segmented point cloud inputs (middle) and the tool points with predicted flow vectors in red (right), where the red flow vectors are enlarged for visual clarity. See this page for additional test-time videos. (To make the video easier to follow, we keep the coordinate frame the same, which cuts off part of the tool point cloud and flow when the tool exceeds the coordinate range of the plot.)

ep_23_ALL_succeed_True.mp4

Baseline: Reinforcement Learning (SAC-CURL)

As a comparison, we run reinforcement learning using SAC and CURL from RGB images on the same two simulation environments (PourWater and ScoopBall) and for both of the supported action spaces within each environment. We test reinforcement learning to highlight the differences in how policies learn with imitation learning versus reinforcement learning, in terms of sample efficiency (number of environment interactions) as well as qualitative and quantitative performance of the final learned policies.

Implementation and training details:

PourWater (3DoF and 6DoF)

The plot to the right shows binary success rates for PourWater (both action variants) for SAC-CURL over 1 million training steps. At the end of training, the maximum success rate along the curves for SAC-CURL with 3DoF and 6DoF actions is just 0.031 and 0.001, respectively. For reference, we overlay the raw success rate of ToolFlowNet from Behavioral Cloning.

Below are test-time rollout examples of PourWater for 3DoF (left) and 6DoF (right) action spaces after 1 million training steps.

The learned policies show some jerky and un-natural pouring behavior, and particularly with the 6DoF action space, the policy appears unable to even get a small fraction of the water particles in the target. This is backed up by the quantitative results shown in the plot, indicating that SAC-CURL has significantly worse policy performance compared to  imitation learning. Furthermore, the policies from imitation learning rely on a dataset of 100 demonstrations with 100 time steps each, which is just 10,000 total (observation, action) pairs.

Note: for the 3DoF action spaces, the SAC-CURL results are consistent with those reported from the CoRL 2020 SoftGym paper which used the 3DoF PourWater environment. The CoRL 2020 paper reported the percentage of water particles in the target cup, and the final performance was about 35-40%. This is significantly lower than the binary success rate cutoff we use in this paper of requiring 75% of the water particles to be in the target, and this explains the 3DoF 0.031 success rate we see for SAC-CURL.

ScoopBall (4DoF and 6DoF)

The plot to the right shows binary success rates for ScoopBall (both action variants) for SAC-CURL over 1 million training steps. At the end of training, the maximum success rate along the curves for SAC-CURL with 4DoF and 6DoF actions is 0.891 and 0.788, respectively. For reference, we overlay the raw success rate of ToolFlowNet from Behavioral Cloning.

Below are test-time rollout examples of ScoopBall for 4DoF (left) and 6DoF (right) action spaces after 1 million training steps.

Unlike with PourWater, here the learned RL policies can attain high success rates by scooping the ball. However, the policies continue to exhibit jerky and rapid behavior (which can cause balls to knock out of control, as shown in some of the failures below). With the 6DoF action space, the policy even after 1 million steps has lower performance (0.750 success rate) as compared to ToolFlowNet which trains from just 25 demonstrations of 100 time steps each, or 2500 (observation, action) pairs and achieved 0.952 success rate.

In the 4DoF scooping case, the SAC-CURL policy is able to attain reasonably good performance. However, this is because the policy learns to avoid scooping the water (which can push the ball away when scooping). In addition, the purpose of our 4DoF scooping experiments was less about attaining high success rates and more about if the learned policy could accurately imitate the scripted demonstrator.  Furthermore, ToolFlowNet uses just 10,000 offline (observation, action) pairs, and at that time step, the RL policy was still attaining test-time performance of 0.000, and for 4D scooping, it requires more than 400,000 steps to match the performance of ToolFlowNet.

BibTeX

@inproceedings{Seita2022toolflownet,

    title={{ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds}},

    author={Seita, Daniel and Wang, Yufei and Shetty, Sarthak and Li, Edward and Erickson, Zackory and Held, David},

    booktitle={Conference on Robot Learning (CoRL)},

    year={2022}

}   

Acknowledgments

This work was supported by LG Electronics and by NSF CAREER grant IIS-2046491. We thank Brian Okorn and Chu Er Pan for assistance with the differentiable SVD layer, and Mansi Agrawal, Sashank Tirumala, and Thomas Weng for paper writing feedback.