Joint-task Self-supervised Learning

for Temporal Correspondence

1Univerisity of California, Merced 2Nvidia 3Carnegie Mellon University

Neural Information Processing Systems (NeurIPS), 2019

Abstract

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

Results

Paper & Code

Pytorch Code: [link]

Paper: [link]

Bibtex

@inproceedings{uvc_2019,
    author = {Li, Xueting and Liu, Sifei and De Mellow, Shalini and Wang, Xiaolong and Kautz, Jan and Yang, Ming-Hsuan},
    title = {Joint-task Self-supervised Learning for Temporal Correspondence},
    booktitle = {NeurIPS},
    year = {2019}
}

Poster [link]