Visual Haptic Reasoning: Estimating Contact Forces

by Observing Deformable Object Interactions

Yufei Wang, David Held, Zackory Erickson

IEEE Robotics and Automation Letter with presentation at IROS 2022

Paper | Code (coming soon)

Abstract
Robotic manipulation of highly deformable cloth presents a promising opportunity to assist people with several daily tasks, such as washing dishes; folding laundry; or dressing, bathing, and hygiene assistance for individuals with severe motor impairments. In this work, we introduce a formulation that enables a collaborative robot to perform visual haptic reasoning with cloth---the act of inferring the location and magnitude of applied forces during physical interaction. We present two distinct model representations, trained in physics simulation, that enable haptic perspective-taking using only visual and robot kinematic observations. We conducted quantitative evaluations of these models in simulation for robot-assisted dressing, bathing, and dish washing tasks, and demonstrate that the trained models can generalize across different tasks with varying interactions, human body sizes, and object shapes. We also present results with a real-world mobile manipulator, which used our simulation-trained models to estimate applied contact forces while performing physically assistive tasks with cloth.

Summar Video

Visual Haptic Reasoning for Robot- Assisted Dressing

For real world videos, the left part shows the task observation, the right part shows the visual haptic model predictions;
For simulation videos, the left part shows the visual haptic model predictions, and the right part shows the ground-truth.

Color indicates the magnitude of the contact normal forces -- red is larger.

output (27).mp4

Dressing trajectory 1 in real world

output (4).mp4

Dressing trajectory 2 in real world

output (5).mp4

Dressing trajectory 3 in real world

output (6).mp4

Dressing trajectory 4 in real world

output (17).mp4

Dressing trajectory 1 in simulation

output (15).mp4

Dressing trajectory 2 in simulation

output (16).mp4

Dressing trajectory 3 in simulation

output (19).mp4

Dressing trajectory 4 in simulation

Visual Haptic Reasoning for Robot- Assisted Bathing

output (1).mp4

Bathing trajectory 1 in real world

output (2).mp4

Bathing trajectory 2 in real world

output (28).mp4

Bathing trajectory 3 in real world

output (9).mp4

Bathing trajectory 1 in simulation

output (11).mp4

Bathing trajectory 2 in simulation

output (10).mp4

Bathing trajectory 3 in simulation

Visual Haptic Reasoning for Robot Dish Washing

plate-GNN-S-small-contact-threshold.mp4

Robot dish washing in real world

plate-camera-ready-1-pn++-a.mp4

Robot dish washing in simulation

Failure Case

bathing-2-PointNet++-A.mp4

This is a failure case. When the wash cloth is lifted above the manikin limb and having no contact with it, our contact model predicts there is contact before the washcloth actually touches the human limb.

Detailed statistics of collected dataset

The exact number of data points for the collected datasets is summarized in the above table.

The average number of object points and cloth points that will be inputted to our model are also reported in the table above.

Imeplementation Details

Detailed Model architectures

For PointNet++, we use the standard segmentation type of architecture in the original paper, which is able to predict per-point values, i.e., contact probability and force magnitude in our case. For both force and contact models, we use 3 set abstraction layers, followed by global max pooling, and then 3 feature propagation layers. The set abstraction radius are [0.05, 0.1, 0.2], and the sampling ratios are [0.4, 0.5, 0.6]. The number of nearest neighbors for feature interpolation layers is [1, 3, 3].

For GNN, we use the standard GNS architecture. It consists of a node and an edge encoder that maps the node and edge features into latent embeddings, a processor that performs message passing along the latent embeddings, and a decoder that decodes the latent embeddings to prediction targets, i.e., the contact probability or the force magnitude. For both the contact and the edge GNN models, the node and the edge encoders are Multi-Layer-Perceptrons (MLPs) of [128, 128] neurons and ReLU activation. The processor consists of 4 GN blocks, where each GN block contains an edge update module and a node update module for performing message passing. Both the edge and node update modules are MLPs of [128, 128] neurons and ReLU activation. The decoder is a MLP of [128, 128] neurons with ReLU activation.


FleX Simulation Parameters

For NVIDIA FleX simulator, the particle radius in the simulator is 0.625cm. Due to different sizes of cloth used in different tasks, we tune the stiffness of the stretch, bending, and shear constraints to ensure stable simulation behaviours of cloth for different tasks. The stiffness of these constraints are set to be [1.7, 1.7, 1.7] for assistive dressing, and [1.0, 0.9, 0.8] for the other three tasks. We set the threshold for contact to be 5 mm.


Detailed Mesh Sizes

The hospital gown mesh used for the assistive dressing tasks consist of 10436 particles and these particles form 20564 triangle faces. The wash cloth used in assistive bathing and dish washing is a square towel represented using a grid of 20x20 particles. The rectangular towel used in the primitive shapes task is represented using a grid of 40x80 particles. The SMPL-X human mesh has 10475 vertices and 20908 triangle faces. The three plates from ShapeNet have 7550, 5841, 2544 vertices and 15096, 11678, 5084 triangle faces, respectively.


Model performance under noise in point cloud

The figure above shows the model's performance under different levels of noises injected into the point cloud. The noises are modeld as independent Gaussian distributions for each point in the point cloud, and we vary the standard deviation of the Gaussian to represent differnet levels of noises.

The left plot shows the force prediction mean absolute error under different levels of noise, and the right shows the contact prediction F1 score with different levels of noise. The x aixs is in units of meter. We note that the force model is robust to a small amount of noise, whie the contact model is robust to larger amount of noises.

Note that the model is trained without injecting any noises. We can potentially make the model more robust to such noises by injecting noises into the point cloud at training time.

More experimental results

We also report the percentage of error, i.e., the ratio between the force prediction MAE and the mean value of the ground-truth force of the dataset, for the best methods on each dataset. For assistive dressing, the PointNet++-S has an error percentage of 33.1%, for assistive bathing, the GNN-S has an error percentage of 34.0%, for dish washing, the GNN-A model has an error percentage of 43.2%, and for primitive shapes, the GNN-S model has an error percentage of 68% due to the large variation for this task.

On a NVIDIA 3090 GPU, the inference time for the GNN force model is ~10ms, and the inference time for the PointNet++ force model is ~20ms. The inference time for the GNN contact model is ~10ms, and the inference time for the PointNet++ contact model is ~20ms.