Program

Day and Time:

20 June 2022, Room 224.

Poster boards: 227a-7b Halls D-E Lobby


Program:

This is the tentative program, which may be subject to changes. Event times shown in the schedule are local times in New Orleans, LA (GMT-5).

Invited Talks

Naturally Limited Videos of Fine-Grained Actions - How to Collect and Learn from Long-Tailed Videos

In this talk, I’ll present the case for collecting unscripted video datasets in their native environments, introducing naturally long-tailed datasets. Using such resource, I will present my group’s approaches to zero-shot action retrieval [ICCV 2019], few-shot recognition [CVPR 2020], domain adaptation [CVPR 2020, ArXiv] and unsupervised learning [CVPR 2022].


Bio:

Dima Damen is a Professor of Computer Vision at the University of Bristol. Dima is currently an EPSRC Fellow (2020-2025), focusing her research interests in the automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors. She has contributed to novel research questions including assessing action completion, skill/expertise determination from video sequences, discovering task-relevant objects, dual-domain and dual-time learning as well as multi-modal fusion using vision, audio and language. She is the project lead for EPIC-KITCHENS, the largest dataset in egocentric vision, with accompanying open challenges. She also leads the EPIC annual workshop series alongside major conferences (CVPR/ICCV/ECCV). Dima is a program chair for ICCV 2021, associate editor of IJCV, IEEE TPAMI and Pattern Recognition. She was selected as a Nokia Research collaborator in 2016, and as an Outstanding Reviewer in CVPR2021, CVPR2020, ICCV2017, CVPR2013 and CVPR2012. Dima received her PhD from the University of Leeds (2009), joined the University of Bristol as a Postdoctoral Researcher (2010-2012), Assistant Professor (2013-2018), Associate Professor (2018-2021) and was appointed as chair in August 2021. She supervises 9 PhD students, and 5 postdoctoral researchers.

Modular generative neural scene reapresentations

Current supervised visual detectors, though impressive within their training distribution, often fail to parse out-of-distribution scenes into their constituent entities. They are feedforward in nature in stark contrast to the extensive top-down feedback connections in the visual cortex. We propose scene representations that encode images into a 3-dimensional and slot-centric neural bottlenecks and decode back the same image or the image from an alternative viewpoint. They are trained semi-supervised from object annotations and autoencoding or view prediction objectives. We show their generalization power and test time adaptation abilities in amodal object segmentation and tracking.


Bio:

Katerina Fragkiadaki is an Assistant Professor in the Machine Learning Department in Carnegie Mellon University. She received her Ph.D. from University of Pennsylvania and was a postdoctoral fellow in UC Berkeley and Google research after that. Her work is on learning visual representations with little supervision and on combining spatial reasoning in deep visual learning. Her group develops algorithms for mobile computer vision, learning of physics and common sense for agents that move around and interact with the world. Her work has been awarded with a best Ph.D. thesis award, an NSF CAREER award, AFOSR Young Investigator award, a DARPA Young Investigator award, Google, TRI, Amazon, UPMC and Sony faculty research awards.


Few-Shot Classification by Recycling Deep Learning

This talk will present recent work I've been involved in on few-shot learning. For this workshop, I'll be framing this work around the point of view of "recycling deep learning". Though not really a new paradigm, in that it overlaps with several others that already exist, I will try to convince you that approaching few-shot classification from that perspective can shift in interesting ways our thinking about what properties we want few-shot learning solutions to satisfy.


Bio:

Hugo Larochelle is a Research Scientist at Google Brain and lead of the Montreal Google Brain team. He is also a member of Yoshua Bengio's Mila and an Adjunct Professor at the Université de Montréal. He has made a number of contributions to the foundations of deep learning, including on unsupervised pre-training of deep networks, denoising autoencoders, zero-shot learning, domain adversarial networks and meta-learning for few-shot learning. He was program chair for ICLR 2015, 2016 and 2017, program chair for NeurIPS 2018 and 2019, and general chair for NeurIPS 2020. He is a member of the boards of NeurIPS and ICML and was a founding member of the board of ICLR. Recently, he co-founded the Transactions on Machine Learning Research.

Data Efficient Learning for Granular Scene Understanding

In this talk I will discuss two recent methods from my group that focus on data efficient learning in the context of granular scene understanding, such as object detection, segmentation, and scene graph

generation. Traditional Methods for object detection and segmentation rely on large scale instance-level annotations for training, which are difficult and time-consuming to collect. Efforts to alleviate this look at varying degrees and quality of supervision. Weakly-supervised approaches draw on image-level labels to build detectors/segmentors, while zero/few-shot methods assume abundant instance-level data for a set of base classes, and none to a few examples for novel classes. In our recent work we aim to bridge this divide by proposing a simple and unified semi-supervised model that is applicable to a range of supervision: from zero to a few instance-level samples per novel class. For base classes, our model learns a mapping from weakly-supervised to fully-supervised detectors/segmentors. By learning and leveraging visual and lingual similarities between the novel and base classes, we transfer those mappings to obtain detectors/segmentors for novel classes; refining them with a few novel class instance-level annotated samples, when, and if, available.


In a conceptually related work, but focusing on a more holistic scene understanding, we propose the first, to our knowledge, framework for pixel-level segmentation-grounded scene graph generation. Our framework is agnostic to the underlying scene graph generation method and address the lack of segmentation annotations in target scene graph datasets (e.g., Visual Genome) through transfer and multi-task learning from, and with, an auxiliary dataset (e.g., MS COCO). Specifically, each target object being detected is endowed with a segmentation mask, which is expressed as a lingual-similarity weighted linear combination over categories that have annotations present in an auxiliary dataset. These inferred masks, along with a Gaussian masking mechanism which grounds the relations at a pixel-level within the image, allow for improved relation prediction.


Bio:

Leonid Sigal is an Associate Professor at the University of British Columbia (UBC). He was appointed CIFAR AI Chair at the Vector Institute in 2019 and an NSERC Tier 2 Canada Research Chair in Computer Vision and Machine Learning in 2018. Prior to this, he was a Senior Research Scientist, and a group lead, at Disney Research. He completed his Ph.D at Brown University in 2008; received his B.Sc. degrees in Computer Science and Mathematics from Boston University in 1999, his M.A. from Boston University in 1999, and his M.S. from Brown University in 2003. He was a Postdoctoral Researcher at the University of Toronto, between 2007-2009. Leonid's research interests lie in the areas of computer vision, machine learning, and computer graphics; with the emphasis on approaches for visual and multi-modal representation learning, recognition, understanding and analytics.

Imagination supervised Machines that can See, Create, Drive, and Feel

Most existing learning algorithms can be categorized into supervised, semi-supervised, and unsupervised methods. These approaches rely on defining empirical risks or losses on the provided labeled and/or unlabeled data. Beyond extracting learning signals from labeled/unlabeled training data, we will reflect in this talk on a class of methods that can leverage either raw or latent imaginary data. We refer to this class of techniques as imagination supervised learning, and we will dive into how we developed several approaches to build machine learning methods that can See, Create, Drive, and Feel. See: recognize unseen visual concepts by imaginative learning signals and how that may extend in a continual setting where seen and unseen classes change dynamically. Create: generate novel art and fashion by creativity losses. Drive: improve trajectory forecasting for autonomous driving by modeling hallucinative driving intents. Feel: generate emotional descriptions of visual art that are metaphoric and go beyond grounded descriptions. I will also conclude by pointing out other related topics and connections that can be made to imaginative learning.


Bio:

Mohamed Elhoseiny is an assistant professor of Computer Science at KAUST. Since Fall 2021, he has become a senior member of IEEE and a member of the international Summit community. Previously, he was a visiting Faculty at Stanford Computer Science department (2019-2020), Visiting Faculty at Baidu Research (2019), Postdoc researcher at Facebook AI Research (2016-2019). Dr. Elhoseiny did his Ph.D. in 2016 at Rutgers University where he was part of the art & AI lab and spent time at SRI International in 2014 and at Adobe Research (2015-2016). His primary research interest is in computer vision and especially in efficient multimodal learning with limited data in zero/few-shot learning and Vision & Language. He is also interested in Affective AI and especially to understand and generate novel visual content (e.g., art and fashion). He received an NSF Fellowship in 2014, the Doctoral Consortium award at CVPR’16, best paper award at ECCVW’18 on Fashion and Design. His zero-shot learning work was featured at the United Nations and his creative AI work was featured in MIT Tech Review, New Scientist Magazine, Forbes Science, and HBO Silicon Valley. He has served as an Area Chair at major AI conferences including CVPR21, ICCV21, IJCAI22, ECCV22, and organized CLVL workshops at ICCV’15, ICCV’17, ICCV’19, and ICCV’21.


Bayesian Few-Shot Classification with One-vs-Each Pólya-Gamma Augmented Gaussian Processes

Few-shot classification, the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Pólya-gamma augmentation and the one-vs-each softmax approximation (Titsias, 2016) that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks. This is joint work with Rich Zemel.


Bio:

Jake Snell is a postdoctoral researcher at Princeton University, working with Thomas Griffiths. He received his Ph.D. in Computer Science from the University of Toronto under the supervision of Richard Zemel, where he also completed a postdoctoral fellowship. His recent research interests include nonparametric Bayesian approaches to few-shot learning and uncertainty quantification for reliable machine learning.

Understanding The Robustness in Vision Transformers

In this talk, I will present our work “Understanding the Robustness in Vision Transformers”, which is going to appear at ICML 2022. In the first part of the talk, I will give a brief survey on recent studies of the intriguing robustness properties in ViTs. In the second part of the talk, I will delve deeper into the trinity among self-attention, visual grouping and information bottleneck, and unveil its relation with ViT’s robust generalization capabilities. On top of these observations, I will present Fully Attentional Network (FAN), a family of general-purpose Vision Transformer backbones that are highly robust to unseen natural corruptions in various visual recognition tasks.


Bio:

Zhiding Yu is a senior research scientist at NVIDIA Research. Before joining NVIDIA in 2018, he received a Ph.D. in ECE from Carnegie Mellon University in 2017, and an M.Phil. in ECE from the Hong Kong University of Science and Technology in 2012. His research interests mainly focus on deep representation learning, weakly/self-supervised learning, transfer learning, and deep structured prediction, with their applications to vision and robotics problems. He is a winner of the Domain Adaptation for Semantic Segmentation Challenge, Workshop on Autonomous Driving (WAD) at CVPR 2018. He is a co-author of the best student paper at ISCSLP 2014, and recipient of the best paper awards at WACV 2015 and BMVC 2020. His work on deep facial expression recognition at Microsoft Research won first runner-up at the Emotion Recognition in The Wild (EmotiW) Challenge 2015 and was integrated into the Microsoft Emotion Recognition API under the Microsoft Azure Cognitive Services. At NVIDIA, he led numerous efforts to develop state-of-the-art Vision Transformer and general visual recognition models, with their applications to auto-labeling, AV perception, and other scene understanding problems.