Learning From Unlabeled Videos

CVPR 2021 Workshop

June 20th, 1:50pm (EDT)

News & Updates

June 13, 2021: We have published the workshop schedule below.
Mar 28, 2021: We are extending the paper submission deadline to April 9, 2021.
Mar 4, 2021: Call for Papers has been released! https://groups.google.com/g/ml-news/c/CaEc7WQ78Ew
Jan 11, 2021: Site under construction. Please check soon for more information.

Overview

Deep neural networks trained with a large number of labeled images have recently led to breakthroughs in computer vision. However, we have yet to see a similar level of breakthrough in the video domain. Why is this? Should we invest more into supervised learning or do we need a different learning paradigm?

Unlike images, videos contain extra dimensions of information such as motion and sound. Recent approaches leverage such signals to tackle various challenging tasks in an unsupervised/self-supervised setting, e.g., learning to predict certain representations of the future time steps in a video (RGB frame, semantic segmentation map, optical flow, camera motion, and corresponding sound), learning spatio-temporal progression from image sequences, and learning audiovisual correspondences.

This workshop aims to promote comprehensive discussion around this emerging topic. We invite researchers to share their experiences and knowledge in learning from unlabeled videos, and to brainstorm brave new ideas that will potentially generate the next breakthrough in computer vision.

Program

(Eastern Time Zone, UTC−04:00)

1:50-2:00 Welcome

2:00-2:30 Invited Speaker 1: Andrew Zisserman. Self-Supervised Video Representation Learning and Beyond. [abstract]
2:30-3:00 Invited Speaker 2: Angjoo Kanazawa. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image. [abstract]
3:00-3:30 Oral Session 1
- - - - 12:00 - 12:10 Learning to Segment Actions from Visual and Language Instructions via Differentiable Weak Sequence Alignment. Yuhan Shen, Lu Wang, Ehsan Elhamifar. [video][paper]
        12:10 - 12:20 Unsupervised object-centric video generation and decomposition in 3D. Paul Henderson, Christoph H Lampert. [video][paper]
        12:20 - 12:30 Object Priors for Classifying and Localizing Unseen Actions. Pascal Mettes, William Thong, Cees Snoek. [video][paper]
3:30 - 4:00 Oral Session 2
- - - - Spatiotemporal Contrastive Video Representation Learning. Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, Yin Cui. [video][paper]
        Unsupervised Action Segmentation for Instructional Videos. AJ Piergiovanni, Anelia Angelova, Michael S Ryoo, Irfan Essa. [video][paper]
        Action Segmentation via Transcript-Aware Union-of-Subspaces Learning. Zijia Lu, Ehsan Elhamifar. [video][paper]
4:00 - 4:30 Invited Speaker 3: Bryan Russell. On Being A Couch Potato. [abstract]
4:30 - 5:00 Invited Speaker 4: Shuran Song. In LUV with Interactions.
5:00 - 5:30 Invited Speaker 5: Deepak Pathak. Unifying Perception and Control through Video. [abstract]
5:30 - 6:00 Oral Session 3
- - - - Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning. Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song. [video][paper]
        Contrastive Learning of Global and Local Audio-Visual Representations. Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song. [video][paper]
        Learning by Aligning Videos in Time. Sanjay Haresh, Sateesh Kumar, Huseyin Coskun, Shahram Najam Syed, Andrey Konin, Zeeshan Zia, Quoc-Huy Tran. [video][paper]
        Multi-Object Tracking with Hallucinated and Unlabeled Videos. Daniel McKee, Bing Shuai, Andrew G Berneshawi, Manchen Wang, Davide Modolo, Svetlana Lazebnik, Joseph Tighe. [video][paper]
6:00 - 6:10 Closing Remarks