Structured Representations for Video Understanding

ICCV 2021 workshop

News and updates

November 26th 2021: The presentations, keynotes, and the panel are on the website now.

October 16th 2021: The papers are on the website, see you at the workshop!

September 24th 2021: The paper decisions have been made and the schedule will be finalized soon.

July 20th 2021: The submission deadline is August 27th. Check submission instructions here.

April 12th 2021: The workshop has been accepted for ICCV 2021.

Workshop schedule

9:00 - 9:10 | Opening

9:10 - 9:40 | Session 1: Learning person and object relations in videos

9:40 - 10:10 | Keynote1: Trevor Darrell [video]

10:10 - 11:00 | Session 2: Self-supervised and unsupervised video representation learning

11:00 - 11:30 | Keynote 2: Kristen Grauman [video]

11:30 - 12:00 | Session 3: Advances in tracking and segmentation

12:00 - 13:50 | Lunch and discussion with authors

13:50 - 14:20 | Keynote 3: Josef Sivic [video]

14:20 - 14:40 | Session 4: Video understanding at small scales

14:40 - 15:10 | Keynote 4: Deva Ramanan [video]

15:10 - 15:30 | Session 5: Video understanding with multiple modalities

15:30 - 16:30 | Keynote panel with Deva Ramanan and Josef Sivic [video]

Overview

Finding the intrinsic structure of a video is an open research problem. Research in video understanding has found a great deal of success through convolutional solutions trained on large-scale datasets. Building on success in the field, a growing body of literature has extended convolutional approaches with additional structure for enhanced or more general understanding. Examples of structures include discovering the scene graph of a video, embedding videos on non-Euclidean manifolds, learning representations from unlabeled videos, and incorporating prior knowledge in video representation learning to better infer seen and unseen action labels. This workshop seeks to open up the discussion on how to learn and impose structure in video understanding. Embedding structure will not only increase our fundamental understanding of videos, but also has a wide range of downstream applications, ranging from action recognition to precise localization and long-term reasoning or forecasting.

Invited speakers

Kristen Grauman

University of Texas at Austin

Deva Ramanan

Carnegie-Mellon University

Josef Sivic

INRIA / Czech Technical University

Trevor Darrell

UC Berkeley

Organizers

Pascal Mettes

University of Amsterdam

Carl Vondrick

Columbia University

Dídac Surís

Columbia University

Hazel Doughty

University of Amsterdam

Mike Shou

NUS Singapore

Shih-Fu Chang

Columbia University

Cordelia Schmid

INRIA / Google