Program
This is the tentative program, which may be subject to changes. Event times shown in the schedule are local times in Seattle WA, USA.
Invited Talks
Daniela Massiceti
Microsoft Research
Can few-shot approaches be used to make large multi-modal models more inclusive?
In this talk, I will present our recent work which evaluates the performance of CLIP, a large multi-modal model (LMM), on data captured by people who are blind or low vision (BLV). LMMs like CLIP are powerful models that can process both visual and textual information, and they have the potential to enable new forms of automated visual assistance for BLV users. However, we show that CLIP is not well suited for the real-world scenarios faced by BLV users. We identify three main sources of performance disparity: 1) the content of BLV images, such as assistive objects, which are not well recognised by CLIP; 2) the quality of BLV images, including blur, orientation and lighting variation, which CLIP is not robust to and 3) the content of BLV language, such as tactile or non-visual adjectives, which are not well understood by CLIP. I will discuss approaches to address these disparities and their implications, including zero-shot approaches that leverage LLMs, and few-shot approaches that leverage data captured by BLV users. I will close the talk with a broader picture of ongoing research at Microsoft Research towards developing models that work for those “in the margins”.
Bio: Daniela Massiceti is a senior machine learning researcher in the Teachable AI Experiences team at Microsoft Research based in Sydney, Australia. She works at the intersection of ML and human-computer interaction and is primarily interested in the sociotechnical innovations we need to ensure AI system work for those in the “tails” of the user distribution - from rethinking data collection and annotation pipelines, to ways of personalising large models to marginalised individuals and communities, to ensuring the evaluation frameworks we use drive research meaningfully forward.
Alaa El-Nouby
Apple
Scalable Pre-training of Large Autoregressive Image Models
This talk discusses our recent work (https://arxiv.org/abs/2401.08541) revisiting unsupervised autoregressive pre-training for vision. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. We highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter model on 2 billion images. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that such models potentially represent a new frontier for training large-scale vision models.
Bio: Alaa El-Nouby is a research scientist at the Machine Learning Research (MLR) team at Apple. Previously, he worked at Meta AI (FAIR). He received his MSc from the University of Guelph in 2019 and his PhD from ENS and Inria Paris in 2023. His research focuses on designing scalable methods for learning multimodal and visual representations. His work includes methods like ImageBind, OmniMAE, DINOv2, and most recently AIM.
Eleni Triantafillou
Google DeepMind
On challenging distribution shifts in learning and unlearning tasks
The trend of building larger and data-hungrier deep learning models has generated many recent success stories. However, a consequence of this trend is that models are more expensive to train than ever before, causing practitioners to reuse previously-trained models for a host of downstream applications more than ever before. Unfortunately, while designing efficient and robust adaptation strategies is an active research area, we still don’t fully understand failure modes of existing technology in coping with diverse distribution shifts. In this talk, I will highlight two challenging instances of this general problem, in learning and unlearning tasks, respectively. In the former, I will show that various ‘’state-of-the-art’' source-free domain adaptation methods fail to transfer well to challenging distribution shifts in bioacoustics, highlighting the need for emphasis on robustness of adaptation algorithms and more diverse benchmarks for evaluation. In the second, I will discuss the problem of unlearning, a different type of “distribution shift” aiming to remove the influence of some training data from pretrained models. I will overview progress in that direction and challenges, highlighting recent work that identifies what makes unlearning problems difficult, and directions for future work.
Bio: Eleni is a research scientist at Google DeepMind. Her research agenda is centered around designing efficient adaptation algorithms for learning and unlearning tasks. Previously, she obtained her PhD from the University of Toronto, on the topic of few-shot learning, under the supervision of Rich Zemel and Raquel Urtasun.
Cees G. M. Snoek
University of Amsterdam
What multimodal foundation models cannot perceive
Multimodal foundation models are a revolutionary class of AI models that provide impressive abilities to generate multimedia content and do so by interactive prompts in a seemingly creative manner. These foundation models are often self-supervised transformer-based models pre-trained on large volumes of data, typically collected from the web. They already form the basis of all state-of-the-art systems in computer vision and natural language processing across a wide range of tasks and have shown impressive transfer learning abilities. Despite their immense potential, these foundation models face challenges in fundamental perception tasks such as spatial grounding and temporal reasoning, have difficulty to operate on low-resource scenarios, and neglect human-alignment for ethical, legal, and societal acceptance. In this talk I will highlight recent work from my lab that identifies several of these challenges as well as ways to update foundation models with limited labels to address these challenges and to do so in a sustainable way, without the need to retrain from scratch.
Bio: Cees G.M. Snoek is a full professor in computer science at the University of Amsterdam, where he heads the Video & Image Sense Lab. He is also a director of three public-private AI research labs: QUVA Lab with Qualcomm, Atlas Lab with TomTom and AIM Lab with the Inception Institute of Artificial Intelligence. At University spin-off Kepler Vision Technologies he acts as Chief Scientific Officer. Professor Snoek is also the director of the ELLIS Amsterdam Unit and scientific director of Amsterdam AI, a collaboration between government, academic, medical and other organisations in Amsterdam to study, develop and deploy responsible AI.
Raoul de Charette
Inria Paris
Scene understanding on the shoulders of foundational models
While foundational models are typically trained on image-level tasks, they exhibit both generalization capacity beyond their training domain and transferability to other tasks. Benefiting from the latter, we demonstrate that piggybacking on frozen foundational models improves scene understanding while at the same time reducing computational and data requirements. In this context, we will explore our recent works which address various scene understanding tasks, from semantic segmentation to brain decoding. First, we will show how minimal feature tuning can rely on language-guidance to boost domain generalized segmentation. Second, relying on the formidable image generation capacity, we will demonstrate how prompt inversion can enable material estimation from a single image. Finally, we will briefly introduce how other modalities, like brain signals, can be aligned with multimodal spaces to decode neural activity, thereby paving the way for assistive devices for disabled individuals.
Bio: Raoul de Charette is researcher at Inria in the Astra team where he leads the Astra-Vision group (https://astra-vision.github.io). His works focuses on interpretable and robust scene understanding with less supervision leveraging priors like common sense and physics. He got his PhD from Mines Paris and also worked in Carnegie Mellon University and University of Makedonia. He is member of ELLIS and Fellow of Prairie-PSAI.
Stella Yu
University of Michigan
Visual Intelligence Emergent from Grounding Recognition on Consistent Segmentations
Image segmentation has evolved such that it is routinely treated as an end task. For examples, for autonomous driving, we are interested in segmenting a road scene into cars, bikes, motorcycles, persons, trees, lamp-posts, traffic signs, curbs, etc; to differentiate a person in different contexts, we label a person on a bike, a bike-rider, a person on a curb, a pedestrian, a person on a horse a horse-rider; to understand the intent and action of a person, we want to segment a person into head, torso, arms, legs. Segment-Anything-Model (SAM) takes supervised segmentation to a large scale, giving a false impression that segmentation is now solved. My view is that segmentation underlies the generalization capability of visual intelligence and supervised segmentation is simply the wrong approach. Segmentation should be treated not as an end-goal itself, but as an internal mid-level representation that serves the master of visual recognition. I will present our recent works in this direction, including unsupervised learning of objectness and visual context, unsupervised discovery of visual semantic hierarchies and part-whole hierarchies.
Bio: Stella Yu received her Ph.D. from Carnegie Mellon University, where she studied robotics at the Robotics Institute and vision science at the Center for the Neural Basis of Cognition. Before she joined the University of Michigan as a Professor of Electrical Engineering and Computer Science in Fall 2022, she has been the Director of Vision Group at the International Computer Science Institute, a Senior Fellow at the Berkeley Institute for Data Science, and on the faculty of Computer Science, Vision Science, Cognitive and Brain Sciences at UC Berkeley. Dr. Yu is interested not only in understanding visual perception from multiple perspectives, but also in using computer vision and machine learning to automate and exceed human expertise in practical applications. http://www.eecs.umich.edu/~stellayu