Day and Time:
June 19, 09:00 am - 5:10 pm.
Program:
This is the tentative program, which may be subject to changes. Event times shown in the schedule are local times in Vancouver, BC.
Poster Session:
10:15 AM - Poster boards #60-92
3:10 PM - Poster boards #61-93
Location: West Exhibit Hall (we are currently at the East building)
Social and Mixer:
We will be hosting a board games and social event directly after the workshop - please email me here for questions issam.laradji@gmail.com
Invited Talks
In this talk, I'll present the need to use limited labeled data to achieve a finer-grained understanding of video content. Particularly, I'll focus on our recent works in domain adaptation [CVPR 2022, ArXiv] and self-supervision [ECCV 2022, ArXiv] where we further reduce the supervision required by making effective use of the data that is available.
Bio:
Hazel Doughty is a postdoctoral researcher at the University of Amsterdam working with Prof. Cees Snoek. She completed her PhD at the University of Bristol in the UK under the guidance of Prof. Dima Damen. During her PhD she also spent several months as a visiting researcher at Inria Paris. In September she will join the University of Leiden as an Assistant Professor. Her research focuses on video understanding, particularly fine-grained understanding and learning with incomplete supervision.
Despite their unprecedented performances when trained on large-scale labeled data, deep networks are seriously challenged when dealing with novel (unseen) tasks and/or limited labeled instances. This generalization challenge occurs in a breadth of real scenarios and applications. In contrast, humans can learn new tasks easily from a handful of examples, by leveraging prior experience and context. Few-shot learning attempts to bridge this gap, and has recently triggered substantial research efforts. This talk discusses some recent results within this subject. I will start by highlighting recent results, which seriously question the relevance of an abundant and popular meta-learning and episodic-training literature. Then, I will advocate transductive inference, which leverages the statistics of unlabeled data, as a promising venue for few-shot learning. I will detail the information maximization (InfoMax) principle as an example of transductive inference. I will further show simple extensions of this principle, which tackles competitively difficult problems such as few-shot semantic segmentation.
Bio:
Ismail Ben Ayed is currently full professor at the ETS Montreal, where he holds a research Chair on artificial intelligence in medical imaging. He is also affiliated with the University of Montreal Hospital Research Centre (CRCHUM). His interests are in computer vision, machine learning, optimization and medical image analysis algorithms. He has published over 130 articles, with an h-index of 47, along with 2 technical books and 7 approved US patents. Most of his articles are published in the topmost venues in vision, learning, and medical imaging. He gave over 50 invited talks and 7 tutorials at flagship conferences. His team collected several international distinctions, such as the MIDL 2021 best-paper award. Ismail has served as Program Chair for MIDL 2020 and regularly as Area Chair for the MICCAI and MIDL conferences. Also, he serves regularly as reviewer for the main journals of the field, and was selected several times among the top reviewers of prestigious conferences (such as CVPR 2021, NeurIPS 2020 and CVPR 2015).
We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with “mixed” supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.
Bio:
Naila Murray obtained a BSE in electrical engineering from Princeton University in 2007. In 2012, she received her Ph.D. from the Universitat Autonoma de Barcelona, in affiliation with the Computer Vision Center. She joined Xerox Research Centre Europe in 2013 as a research scientist in the computer vision team, working on topics including fine-grained visual categorization, image retrieval and visual attention. From 2015 to 2019 she led the computer vision team at Xerox Research Centre Europe, and continued to serve in this role after its acquisition and transition to becoming NAVER LABS Europe. In 2019, she became the director of science at NAVER LABS Europe. In 2020, she joined Meta AI’s FAIR team, where she is currently a senior research manager. She has served as area chair for ICLR 2018, ICCV 2019, ICLR 2019, CVPR 2020, ECCV 2020, and CVPR 2022, and program chair for ICLR 2021. Her current research interests include few-shot learning and domain adaptation.
This talk discusses recent research on self-supervised learning from video. Firstly, I will present a conceptually simple extension of Masked Autoencoders (MAE) for spatiotemporal representation learning from videos. Next, I will demonstrate how to learn self-supervised audio representations within the same framework and introduce an approach to training representations for integrated audio-visual perception, showcasing strong performance across various tasks. Lastly, I will discuss Hiera, a new architecture—a simple hierarchical vision transformer trained with MAE—which outperforms existing methods while eschewing unnecessary complexities.
Bio:
Christoph Feichtenhofer is a Research Scientist Manager at Meta AI (FAIR). He received the BSc, MSc and PhD degree in computer science from TU Graz in 2011, 2013 and 2017, and spent time as a visiting researcher at York University, Toronto as well as the University of Oxford. He is a recipient of a DOC Fellowship of the Austrian Academy of Sciences and was awarded with the Award of Excellence for outstanding doctoral theses in Austria. His main areas of research include the development of effective representations for image and video understanding.
The power of low-level image segmentation has been rediscovered in computer vision, and in combination with neo-traditional deep end-to-end recognition pipelines is leading to rapid advances in few shot learning performance. I'll summarize recent progress on both image and video tasks at UC Berkeley, and discuss how hierarchical segmentation models provide necessary representational leverage. As time permits I'll also present methods for language-guided model advice, where vision and language models can learn an appropriate target concept through natural language interaction. Taken together these methods point to a dramatically different path for future computer vision systems, where large-scale labels are unnecessary and users specify concepts and refine models using sparse multimodal feedback.
Bio:
Up to 1M species face extinction in the next several decades, with biodiversity loss a major factor. In this talk, I will describe my team’s efforts to leverage advances in deep learning to monitor and improve biodiversity health. Our work is bolstered by the $24M University of Guelph-led BIOSCAN project, a global interdisciplinary effort to build a biodiversity observation system; and LIFEPLAN, a global biodiversity monitoring effort that collects data, including images, audio and DNA samples, from around 100 sites worldwide. Manual analysis of the data collected in these massive international biodiversity efforts are resource prohibitive and their success will depend on automating the analysis of images, sets, sequences, and graphs.
Bio:
Graham Taylor is a Canada Research Chair and Professor of Engineering at the University of Guelph. He co-directs the University of Guelph Centre for Advancing Responsible and Ethical AI and is the Research Director of the Vector Institute for AI. He has co-organized the annual CIFAR Deep Learning Summer School, and trained more than 80 students and researchers on AI-related projects. In 2016 he was named as one of 18 inaugural CIFAR Azrieli Global Scholars. In 2018 he was honoured as one of Canada's Top 40 under 40. In 2019 he was named a Canada CIFAR AI Chair. He is the Academic Director of NextAI, a non-profit accelerator for AI-focused entrepreneurs.