TACK: Few-Shot Keypoint Detection as Task Adaptation via Latent Embeddings

Mel Vecerik1,2, Jackie Kay1,2, Raia Hadsell2, Lourdes Agapito1, Jon Scholz2

1University College London, UK, 2DeepMind, UK

Abstract

Dense object tracking, the ability to localize specific object points with pixel-level accuracy, is an important computer vision task with numerous downstream applications in robotics. Existing approaches either compute dense keypoint embeddings in a single forward pass, meaning the model is trained to track everything at once, or allocate their full capacity to a sparse predefined set of points, trading generality for accuracy. In this paper we explore a middle ground based on the observation that the number of relevant points at a given time are typically relatively few, e.g. grasp points on a target object. Our main contribution is a novel architecture, inspired by few-shot task adaptation, which allows a sparse-style network to condition on a keypoint embedding that indicates which point to track. Our central finding is that this approach provides the generality of dense-embedding models, while offering accuracy significantly closer to sparse-keypoint approaches. We present results illustrating this capacity vs. accuracy trade-off, and demonstrate the ability to zero-shot transfer to new object instances (within-class) using a real-robot pick-and-place task.

Paper link: arxiv

Our approach, TACK, can take a single annotation (blue) to infer a latent identity embedding. This embedding can be used to detect such point from novel views even when parts of the object are occluded.

Combining Task Adaptation and Conditional Approaches

TACK is trained by a combination of task-adaptation and conditional auto-encoder losses. Both of these split detecting a point into 2 steps: 1) Identify which point currently being detected, i.e. infer the task 2) Detect this point in an image. The first task is performed by an encoder model (blue below) which maps images and their annotations to low dimensional (4-dim) point identity vectors, c𝜏. These are then combined with an image using the decoder model (red below) to detect the point location. All weights between encoder and decoder models are shared.

Task adaptation loss diagram

Conditional autoencoder loss diagram

In principle either an adaptation on autoencoder loss could be sufficient to learn a conditioned detector, however as we show below both of these losses are required for stable and efficient training.

In this plot we show the RMS pixel error for 3 different training regimes of using only a subset of losses. Left: Evaluated in the task adaptation setting. Right: Evaluated in the conditional autoencoder setting. In both cases, training with a combination of the adaptation and autoencoder losses is beneficial. Further details about this experiment are available in the paper.

Class Generalization Visualisation by Latent Space Interpolation

To analyze the structure of the point identity latent space, we pick 7 embedding pairs based on random points on the first shoe and interpolate linearly between the pairs in an embedding space. The same embeddings correspond to similar locations on each shoe, showing that TACK has learned a consistent mapping across instances. This is an interesting property as no cross instance labels were provided during training. This means that the model has learned a cross-instance consistent embedding way as a fully emergent property. Presented shoes are from a withheld set unseen during training.

Supplementary Video with Real Robot Experiments

In this video we present the summary of the approach as well as examples of real world performance.

Saliency analysis

To further understand about the features our model learns about we performed saliency analysis of the decoder model on real data. This shows that the model doesn't focus purely on the detected point, but uses the entire geometry of the tracked object. This is a critical property to achieve a robust detector. Further information is provided in the paper.