S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via Multi-View Consistency

Overview

3D keypoints are a very useful representation for solving robotics tasks. However training a deep model to accurately detect them can be cumbersome because it requires a lot of data. Within this work we investigate the effects of adding a self-supervised loss to decrease the ammount of labels needed. We show that by using multi-view geometry across 4 calibrated cameras we can use unlabelled data almost as efficiently as labelled data in certain circumstances - if we have sufficient labelling to ground the semantics of the keypoint. The advantage of this compared to texture based methods is that the the keypoints do not need to be fixed to a surface location. Indeed they do not need to be fixed to any physical location at all as long as the location is consistent across views and episodes.

Within this paper, we investigate this on a simple simulated environment and then demonstrate how these keypoints can be used for scripting a grasp of a highly deformable object - plush fox. Furhter we show how these keypoints can be used to solve a flexible cable insertion problem by using reinforcement learning to learn the controller.

Paper link: arxiv

Paper Abstract

A robot’s ability to act is fundamentally constrained by what it can perceive. Many existing approaches to visual representation learning utilize general-purpose training criteria, e.g. image reconstruction, smoothness in latent space, or usefulness for control, or else make use of large datasets annotated with specific features (bounding boxes, segmentations, etc.). However, both approaches often struggle to capture the fine-detail required for precision tasks on specific objects, e.g. grasping and mating a plug and socket. We argue that these difficulties arise from a lack of geometric structure in these models. In this work we advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective that can allow instance or category-level keypoints to be trained to 1-5 millimeter-accuracy with minimal supervision. Furthermore, unlike local texture-based approaches, our model integrates contextual information from a large area and is therefore robust to occlusion, noise, and lack of discernible texture. We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours. Finally we show that these keypoints provide a good way to define reward functions for reinforcement learning and area good representation for training agents.

Model Overview

Our model predicts location heatmaps for each camera and keypoint. For the supervised path we require annotations in at least 2 out of 4 cameras. For the unlabelled data we use the coordinates and uncertainties from the model to estimate the keypoint locations instead. These 3D locations are then projected back to the camera frames and provide a learning target for the model via a KL-loss.

Supervised Path

This part is easy -- given 2D labels from two or more views we compute the corresponding 3D point via ray-intersection. Once we have this point we project it back to all available cameras (thanks to their calibration) to obtain a 2D point in each image where we expect to see the point. The supervised loss is just the KL-divergence between the model's output (softmaxed to produce a heatmap) and a Gaussian at this point.


Gaussian Heatmaps:

where the subscripts indicate the camera c and keypoint k.

Supervised Loss:

Unsupervised Path

What do we do if there are no 2D annotations for a given time-point? The same thing (almost)! Specifically, we take the mean of each camera's heatmap as a proxy for the label, and then repeat the estimation & re-projection step from the supervised loss. This amounts to enforcing view-consistency, which is a general-purpose but powerful bias towards geometrically meaningful representations.

Small tweak: it turned out to be handy to add an uncertainty estimate to the 2D prediction while solving for the 3D point, corresponding to a weighted least-squares problem. We obtained these directly from the heatmap variance.

Solving for estimated 3d location (weighted least-squares):

Unsupervised Loss:

Network Architecture

We used a U-net architecture to allow high-level features to be learned while preserving spatial resolution. We added 3 resnet-like blocks to the lowest resolution to add extra network capacity,

The Value of Unlabeled Data

We use a simple simulated environment where we are learning to detect 3 corners of a textured box. The background of the scene is similar as in our physical robotics experiments.

Error vs. Labeled Examples

Each curve corresponds to a number of unlabelled data-points. Given 10k unlabelled images, 10-20 labeled samples is sufficient to train an accurate keypoint detector.

Error vs. Overall Dataset Size

This figure contains the same data, but expressed on a single plot with colors indicating the ratio of labeled to unlabelled data. Beyond a minimum number, labeled and unlabeled data are equally valuable in driving down model errors.

Keypoints for Reinforcement Learning

Model Training

The goal of this task is to insert a cable that is non-rigidly grasped by the robot (i.e. by the cable). To solve it we trained S3K to generate 3 points -- two on the plug-tip and one on the socket (see below).

This task makes full use of the self-supervised pathway in the model. In fact, it was this setting that initially motivated our approach, since we can't utilize the label-propagation trick on non-static scenes, and it's tedious to label 10,000 images!

Cable Keypoint Annotation

Note we annotated the plug tip (red) to be at the socket after insertion starts, when the tip is no longer visible.

Cable Keypoint Detection

Model detections for two points in time. Each row is a single time-step.

S3K for Reinforcement Learning

We compared agent performance using the S3K model to a 10-dimensional VAE embedding of the scene from all 4 cameras. Both models were trained on the same dataset, but the S3K had access to 120 labeled images. S3K performed significantly better than a VAE. We believe this was possible due to the much higher accuracy of the task relevant features.

The core RL algorithm was DPGfD, and the training process was similar to the approach described in the EDRIAD paper.

Learning curves for VAE-based and keypoint-based agents. The keypoint agent was significantly faster at solving the task.

Related Work Links

For convenience we provide links to some of the related work to this paper:

Dense object nets: paper, video
kPAM: site
Keypose: paper, site
Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation: paper
EDRIAD: paper, site

Cite as

@inproceedings{vecerik2020s3k,

title={S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via Multi-View Consistency},

author={Mel Vecerik and Jean-Baptiste Regli and Oleg Sushkov and David Barker and Rugile Pevceviciute and Thomas Rothörl and Christopher Schuster and Raia Hadsell and Lourdes Agapito and Jonathan Scholz},

booktitle={Conference on Robot Learning (CoRL)},

year={2020},

}