Workshop papers

Session I (9:10 - 9:40)

Learning person and object relations in videos

  1. Spatio-Temporal Dynamic Inference Network for Group Activity Recognition Hangjie Yuan (Zhejiang University), Dong Ni (Zhejiang University), Mang Wang (Alibaba Group)
    [paper] [slides] [video]

  2. Learning Higher-order Object Interactions for Keypoint-based Video Understanding Yi Huang (Virginia Tech), Asim Kadav (NEC Labs), Farley Lai (NEC Laboratories America, Inc.), Deep Patel (NEC Laboratories America), Hans Peter Graf (NEC Labs)
    [paper] [slides] [video]

  3. Social Fabric: Tubelet Compositions for Video Relation Detection Shuo Chen (University of Amsterdam), Zenglin Shi (University of Amsterdam), Pascal Mettes (University of Amsterdam), Cees Snoek (University of Amsterdam)
    [paper] [slides] [video]

Session II (10:10 - 11:00)

Self-supervised and unsupervised learning in videos

  1. Video Autoencoder: self-supervised disentanglement of static 3D structure and motion Zihang Lai (CMU), Sifei Liu (NVIDIA), Alexei Efros (UC Berkeley), Xiaolong Wang (UCSD)
    [paper] [slides] [video]

  2. Compositional Video Synthesis with Action Graphs Amir Bar (Tel Aviv University), Roei Herzig (Tel Aviv University), Xiaolong Wang (Carnegie Mellon University), Anna Rohrbach (UC Berkeley), Gal Chechik (Bar Ilan University), Trevor Darrell (UC Berkeley), Amir Globerson (Tel Aviv University, Google)
    [paper] [slides] [video]

  3. Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning Yansong Tang (University of Oxford), Zhenyu Jiang (University of Texas - Austin), Zhenda Xie (Tsinghua University), Yue Cao (Microsoft Research Asia), Zheng Zhang (MSRA, Huazhong University of Science and Technology), Philip Torr (University of Oxford), Han Hu (Microsoft Research Asia)
    [paper] [slides] [video]

  4. Composable Augmentation Encoding for Video Representation Learning Chen Sun (Google)
    [paper] [slides] [video]

  5. TCLR: Temporal Contrastive Learning for Video Representations Ishan Rajendrakumar Dave (University of Central Florida), Rohit Gupta (University of Central Florida), Mamshad Nayeem Rizve (University of Central Florida), Mubarak Shah (University of Central Florida)
    [paper] [slides] [video]

Session III (11:30 - 12:00)

Advances in video tracking and segmentation

  1. Rethinking Self-Supervised Correspondence Learning: A Video Frame-level Similarity Perspective Jiarui Xu (University of California San Diego), Xiaolong Wang (UCSD)
    [paper] [slides] [video]

  2. Learning event representations for temporal segmentation of image sequences by dynamic graph embedding Mariella Dimiccoli (CSIC-UPC), Herwig Wendt (IRIT-ENSEEIHT, CNRS, University of Toulouse)
    [paper] [slides] [video]

  3. SFTrack++: A Fast Learnable Spectral Segmentation Approach for Space-Time Consistent Tracking Elena Burceanu (Bitdefender, University of Bucharest), Marius Leordeanu (University "Politehnica" of Bucharest)
    [paper] [slides] [video]

Session IV (14:20 - 14:40)

Video representation learning at small scales

  1. Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction Gaurav Ramola (Samsung Research Institute, Bangalore), Rishubh Parihar (Indian Institute of Technology, Delhi), Ranajit Saha (Microsoft Corporation, Hyderabad, India), Raviprasad Kini (Samsung India), Aniket Rege (Samsung India), Sudha Velusamy (Samsung Research Institute, Bangalore)
    [paper] [slides] [video]

  2. Motion-Augmented Self-Training for Video Recognition at Smaller Scale Kirill Gavrilyuk (University of Amsterdam), Mihir Jain (Qualcomm AI Research), lia Karmanov (Qualcomm Research),Cees Snoek (University of Amsterdam)
    [paper] [slides] [video]

Session V (15:10 - 15:30)

Understanding videos with multiple modalities

  1. Semantic Role Aware Correlation Transformer For Text To Video Retrieval Burak Satar (NTU), Hongyuan Zhu (Institute for Infocomm, Research Agency for Science, Technology and Research (A*STAR) Singapore), Xavier Bresson (NUS), Joo-Hwee Lim (Institute for Infocomm Research)
    [paper] [slides] [video]

  2. Vision-Guided Forecasting - Visual Context for Multi-Horizon Time Series Forecasting Eitan Kosman (Technion), Dotan Di Castro (Bosch AI)
    [paper] [slides] [video]