The following page provides talk titles and abstracts from the 2019 AI Video Summit. In certain cases, speakers elected not to provide this information and thus descriptions of several talks are not included below.
Philipp Krähenbühl, Assistant Professor, Department of Computer Science, University of Texas at Austin
Title: On modeling time
Abstract: In this talk, I'll present our most recent work on long-term temporal modeling for video action recognition. Our model is able to reason about minutes to hours of video all at once, and (until recently) performed at the state-of-the-art on several video recognition tasks. Towards the end of this talk, I'll dive into the technical details of our model and show how little our model (and most state-of-the-art models) use time. I'll present some simple variants of state-of-the-art models that are completely oblivious to time (specifically the order of time) and do not lose any performance.
Lubomir Bourdev, Co-founder and CEO, WaveOne, Inc.
Title: Learned Video Compression
Abstract: Video compression and analytics are of critical importance, as video content comprises more than 75% of internet traffic and is projected to grow rapidly. Today video compression algorithms are based on representations manually designed by experts, and hard-coded in hardware. We believe ML-derived video representation will be key to revolutionizing compression. I will present our work on video compression learned for the low-latency mode (where the encoder has no access to future frames). We are the first end-to-end ML-based method for the low-latency mode and we outperform the standards by a large margin on almost the entire RD curve.
Cordelia Schmid, Research Director, INRIA; Research Scientist, Google AI
Title: High-level structured representations for action recognition
Vladlen Koltun, Senior Principal Researcher and Director of the Intelligent Systems Lab, Intel
Title: Does Computer Vision Matter for Action?
Abstract: Computer vision produces representations of scene content. Much computer vision research is predicated on the assumption that these intermediate representations are useful for action. Recent work at the intersection of machine learning and robotics calls this assumption into question by training sensorimotor systems directly for the task at hand, from pixels to actions, with no explicit intermediate representations. Thus the central question of our work: Does computer vision matter for action? We probe this question and its offshoots via immersive simulation, which allows us to conduct controlled reproducible experiments at scale. We instrument immersive three-dimensional environments to simulate challenges such as urban driving, off-road trail traversal, and battle. Our main finding is that computer vision does matter. Models equipped with intermediate representations train faster, achieve higher task performance, and generalize better to previously unseen environments.
Kristen Grauman, Research Scientist, Facebook AI; Professor, Department of Computer Science, University of Texas at Austin
Title: Eyes and Ears: Learning to Disentangle Sounds in Unlabeled Video
Abstract: Understanding scenes and events is inherently a multi-modal experience: we perceive the world by both looking and listening. In this talk, I will present our recent work learning audio-visual models from unlabeled video. A key challenge is that typical videos capture object sounds not as separate entities, but as a single audio channel that mixes all their frequencies together and obscures their spatial layout. Considering audio as a source of both semantic and spatial information, we explore learning multi-modal models from real-world video comprised of multiple sound sources. In particular, I will introduce new methods for visually guided audio source separation and “2.5D visual sound”, which lifts monaural audio into its immersive binaural counterpart via the visual video stream.
Andrew Zisserman, Professor, Department of Engineering Science, University of Oxford
Title: Multi-modal self-supervised learning
Abstract: The talk will review the various methods of audio-visual multi-modal self-supervision - correspondence and synchrony - in video, and describe how these can be employed for learning to localize the source of a sound in a video, and for fine-grained categorization.
Louis-Philippe Morency, Leonardo Associate Professor, Department of Computer Science, Carnegie Mellon University
Title: Multimodal AI: Understanding Human Behaviors
Abstract: Human face-to-face communication is a little like a dance, in that participants continuously adjust their behaviors based on verbal and nonverbal cues from the social context. Today's computers and interactive devices are still lacking many of these human-like abilities to hold fluid and natural interactions. Leveraging recent advances in machine learning, audio-visual signal processing and computational linguistic, my research focuses on creating computational technologies able to analyze, recognize and predict human subtle communicative behaviors in social context. Central to this research effort is the introduction of new probabilistic models able to learn the temporal and fine-grained latent dependencies across behaviors, modalities and interlocutors. In this talk, I will present some of our recent achievements modeling multiple aspects of human communication dynamics, motivated by applications in healthcare (depression, PTSD, suicide, autism), education (learning analytics), business (negotiation, interpersonal skills) and social multimedia (opinion mining, social influence).
Andrew Owens, Post-Doctoral Fellow, University of California, Berkeley
Title: Learning Sight from Sound
Abstract: Today's visual learning methods require extensive supervision from human teachers. A major goal of the research community has been to remove the need for this supervision by creating methods that, instead, teach themselves by analyzing unlabeled images. In this talk, I will argue that this focus on learning from vision alone, without the use of other sensory modalities, is making the perception problem unnecessarily difficult. I will present computer vision methods for learning from co-occurring audio and visual signals. First, I'll discuss our work on using self-supervision to learn multisensory video representations. Second, I'll talk about our work on predicting conversational gestures, in the form of arm/hand motion, from speech audio.
Carl Vondrick, Assistant Professor, Department of Computer Science, Columbia University
Title: Tracking Emerges from Video Colorization
Abstract: We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform the latest methods based on optical flow. Moreover, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.
Alexei A. Efros, Professor, Electrical Engineering and Computer Science, University of California, Berkeley
Title: Learning Correspondence from the Cycle-Consistency of Time
Abstract: We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time.
Michael Black, Founding Director, Max Planck Institute; Distinguished Amazon Scholar, Amazon
Title: Expressive human body models for communication and interaction
Abstract: Human bodies in computer vision have often been an afterthought. Human pose is typically represented by 10-12 body joints in 2D or 3D. The joints of the body, however, do not capture all that we need to understand human behavior. In our work we have focused on 3D body shape, represented as a triangulated mesh. Shape gives us information about a person related to their health, age, fitness, and clothing size. But shape is also useful because our body surface is critical for our physical interactions with the world. We cannot interpenetrate objects and we have to make contact to manipulate the world. Consequently we developed the SMPL 3D body model, which is widely used in academia and industry. It is compact, posable, and compatible with most graphics packages. It is also differentiable and easy to integrate into optimization or deep learning methods. While popular, SMPL has drawbacks for representing human actions and interactions. Specifically, the face does not move and the hands are rigid. To facilitate the analysis of human actions, interactions and emotions, we have trained a new 3D expressive model of human body, SMPL-X, with fully articulated hands and an expressive face, using thousands of 3D scans. We estimate the parameters of SMPL-X directly from images. Specifically, we estimate 2D image features bottom-up and then optimize the SMPL-X model parameters to fit the features top-down. In related work, we address hand object interaction by training a neural network to simultaneously regress hand and object pose and shape. A key novelty is a loss function that enforces physical constraints on contact and penetration. These methods represent a step towards automatic expressive human capture from monocular RGB data.
Angjoo Kanazawa, Post Doctoral Fellow, Berkeley AI Research, University of California, Berkeley
Title: Perceiving Humans in the 3D World from Video
Abstract: As we approach a society where intelligent systems and humans coexist, it is increasingly important that these systems are able to interact with humans from rich visual sensors such as cameras. Interaction requires visual understanding of people, such as where they are, what they have been doing and what they will do. In this short talk, I will go over our recent approach to recover 3D human mesh model from a video. I will discuss how this opens up new directions, such as predicting the future 3D motion of the human from a single image or a video. Recovering motion is also useful for visual imitation, where the 3D human video perception allows us to train a simulated character to learn to act by watching YouTube videos.
Hanbyul Joo, Research Scientist, Facebook AI
Title: Measuring and Modeling Social Behaviors for Machines
Abstract: Humans convey their thoughts, emotions, and intentions through a concert of social displays: voice, facial expressions, hand gestures, and body posture, collectively referred to as social signals. Despite advances in machine perception, machines are unable to discern the subtle and momentary nuances that carry so much of the information and context of human communication. A major obstacle to scientific progress in this direction is the inability to sense and measure the broad spectrum of behavioral cues in groups of interacting individuals, which hinders applying computational methods to model and understand social signals. In this talk, I will share our efforts in measuring and modeling nonverbal social behaviors. I will introduce our new 3D motion capture dataset capturing hundreds of triadic interactions at the CMU Panoptic Studio, and discuss the social signal prediction problem as a way to model social behaviors. I will also talk our monocular total body motion capture method, pursuing to measure social behaviors in the Internet videos.
Cees Snoek, Professor, Computer Science, University of Amsterdam
Title: Two-in-One Stream Action Detection
Abstract: The two-stream network based on RGB and flow provides robust action detection and classification accuracy at the expense of a large model-size and heavy computation. We propose to embed RGB and optical-flow into a single two-in-one stream network with new layers. A motion condition layer extracts motion information from flow images, which is leveraged by the motion modulation layer to generate transformation parameters for modulating the low-level RGB features. The method is easily embedded in existing appearance- or two-stream action detection and classification networks, and trained end-to-end. Experiments demonstrate that leveraging the motion condition to modulate RGB features improves accuracy. With only half the computation and parameters of the state-of-the-art two-stream methods, our two-in-one stream still achieves impressive results. Joint work with Jiaojiao Zhao.
Juan Carlos Niebles, Senior Research Scientist, Stanford University; Associate Professor, Universidad del Norte
Title: Human Event Understanding: From Actions to Tasks
Svetlana Lazebnik, Associate Professor, Department of Computer Science, University of Illinois at Champaign-Urbana
Title: Visual Relationship Detection and Scene Graph Generation
Abstract: In this talk, I will address the problem of Visual Relationship Detection (VRD), which focuses on understanding interactions between pairs of object entities in the image. I will describe our new approach to this problem that achieves strong performance both on common and previously unseen relations, as well as the application of this approach to the recently proposed task of scene graph generation.
Bryan Russell, Research Scientist, Adobe Systems
Title: Bounce and Learn: Modeling Scene Dynamics with Real-World Bounces
Abstract: In this talk I will describe an approach to model surface properties governing bounces in everyday scenes. Our model learns end-to-end, starting from sensor inputs, to predict post-bounce trajectories and infer two underlying physical properties that govern bouncing - restitution and effective collision normals. Our model, Bounce and Learn, comprises two modules – a Physics Inference Module (PIM) and a Visual Inference Module (VIM). VIM learns to infer physical parameters for locations in a scene given a single still image, while PIM learns to model physical interactions for the prediction task given physical parameters and observed pre-collision 3D trajectories. To demonstrate our results, we introduce the Bounce Dataset comprising 5K RGB-D videos of bouncing trajectories of a foam ball to probe surfaces of varying shapes and materials in everyday scenes including homes and offices.
Ivan Laptev, Research Director, INRIA
Title: Towards Embodied Action Understanding
Abstract: Computer vision has come a long way towards automatic labeling of objects, scenes and human actions in visual data. While this recent progress already powers applications such as visual search and autonomous driving, visual scene understanding remains an open challenge beyond specific applications. In this talk I will outline limitations of human-defined labels and will argue for the task-driven approach to scene understanding. Towards this goal I will describe our recent efforts on learning visual models from narrated instructional videos. I will present methods for automatic discovery of actions and object states associated with specific tasks such as changing a car tire or making coffee. Along these efforts, I will describe a state-of-the-art method for text-based video search using our recent dataset with automatically collected 100M narrated videos. Finally I will present our work on visual scene understanding for real robots where we learn agents to discover sequences of actions to achieve particular tasks.
Christoph Feichtenhofer, Research Scientist, Facebook AI
Title: SlowFast Networks for Video Recognition
Abstract: We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.
Joao Carreira, Research Scientist, Google DeepMind
Title: Kinetics-700, Visual Centrifuges and Action Transformers
Abstract: In this talk I will first present the latest version of the Kinetics dataset that has 700 classes. I will also discuss recent work that aims to improve the generalization and data efficiency of video models by factoring out nuisance factors such as transparency and reflections. Finally I will talk about our Video Action Transformer model which uses self-attention for spatiotemporal action localization.
Lorenzo Torresani, Research Scientist, Facebook AI; Associate Professor, Computer Science Department, Dartmouth
Title: Action Recognition with Channel Separation and Salient-Clip Sampling
Abstract: In this talk I will present two orthogonal approaches designed to improve both the efficiency and the accuracy of action recognition systems. The first hinges on a new architecture that interleaves channel-separated 3D convolutions with point-wise convolutions across channels. Channel separation reduces dramatically the number of parameters and the computational cost, while the point-wise convolutions efficiently restore the lost channel interactions. The second approach uses a lightweight “clip-sampling” model to efficiently identify the most salient clips within a long video. By invoking recognition only on these most salient clips, the computational cost of action classification on untrimmed videos is dramatically reduced and the accuracy is improved. Large gains can be obtained by combining these two approaches, e.g., a 30x speedup and a 8% accuracy boost over the state-of-the-art on Sports1M.
Chen Sun, Senior Research Scientist, Google
Title: Learning Temporal Representations for Long Videos
Abstract: We present two recent work on learning temporal representations for long videos, without requiring additional human annotations. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use this model in a number of tasks, including zero-shot action classification and video captioning. We then introduce a technique based on noise contrastive estimation which removes the quantization step, thus offering more accurate representations for fine-grained recognition tasks.
Bernard Ghanem, Associate Professor, King Abdullah University of Science and Technology
Title: The State of Temporal Activity Detection in Untrimmed Video
Abstract: In this talk, I will give a brief summary of the current state of methods and performance for the task of temporal human activity detection in untrimmed video. For this summary, I will specifically focus on insights and findings brought about by the ActivityNet benchmark and challenge that has been hosted at CVPR since 2016. I will also highlight some persisting issues that require our attention, and in doing so, shed light on directions for future research in activity detection.