One might dismiss generative image models as merely an effective way to fool investors and voters. To do so is an error; it looks as though they will open a fascinating intellectual agenda. Models know stuff they weren’t taught , and it’s relatively easy to recover it. They can be trained without supervision, which is useful. But there’s a lot of stuff they’re really stupid about, and which probably can’t be fixed by adding data. There are two urgent things to do: figure out how to make these things behave, and figure out how to turn computer vision into a science of observation. You want to solve some vision problem? Fish the answer out of the innards of a smart enough generative model
Much knowledge can be thought of as association: associating details to themes, answers to questions, words to meanings. In this talk we will examine how understanding the mathematics of neural association can lead to an understanding of the structure of stored knowledge in large generative models, and how this understanding can be used to directly edit large generative models to mitigate biases and undesired behavior, edit their beliefs, or change the rules that underpin their modeling of the world. We will apply these ideas in GANs, LLMs, and text-to-image diffusion models.
Humans are remarkably adept at visual recognition in the wild. We recognize many thousands of object categories; at all levels of recognition: pixel, group and image; and quickly learn new categories with very few labeled examples. However, AI algorithms lag significantly behind and are limited to far smaller closed vocabularies of hundreds of object categories, especially for pixel-level labeling tasks. How can we scale AI systems to human-like open-vocabulary recognition capabilities? Recent image-text foundation models, e.g., CLIP, which are trained on large Internet-scale data, present a promising path towards improving image-level zero-shot recognition. Going beyond image recognition, we present our pioneering work on exploring the more challenging problem of pixel-level open-vocabulary recognition, with large text-image contrastive and generative foundational models. We present the learnings that we’ve garnered from our work in terms of the capabilities and challenges of leveraging large multi-modal foundation models for this task. We close with discussions of several avenues for future research in this area.
Anton Obukhov
Daniel Winter
Samyadeep Basu
Lorenzo Olearo
Anisha Jain
Xindi Wu
James Burgess
Fantastic emergent representations and where to find them?
Various useful intermediate representations emerge in generative and discriminative models. First, I demonstrate the existence of common intermediate representations ("Rosetta neurons") across a range of models with different architectures, different tasks, and different types of supervision (class-supervised, text-supervised, self-supervised). I present how to mine for a dictionary of Rosetta neurons across several popular vision models. They are used to facilitate model-to-model translation that enables various inversion-based manipulations, including cross-class alignments, shifting, zooming, and more, without the need for specialized training.
Second, I will show that in diffusion models, useful intermediate representations can emerge in their weight space. I will present a sub-space of diffusion model weights ("weights2weights" space) computed from over 60,000 models, each of which is fine-tuned to insert a different person’s visual identity. I will demonstrate that this space allows three immediate applications - sampling a new identity, editing an existing identity, and inverting an identity to the model weights for a single image. The results indicate that universal interpretable latent space can emerge in different model types and model components (weights, intermediate state, and neurons) and unlock various training-free applications.
Language models excel at generating text, but what do they know about the visual world? Can they also generate visual data? I will argue that they can; with the help of a code interpreter, they can generate diverse and appealing images. I will share recent results on what kinds of visual structures language models represent, and I will show a use case where an LLM can be used to improve visual representation learning.
Beyond the confines of flat screens, 3D generative models are crucial to create immersive experiences in virtual reality, not only for human users but also for robotics. Virtual environments or real-world simulators, often comprised of complex 3D/4D assets, significantly benefit from the accelerated creation enabled by 3D Gen AI. In this talk, we will introduce our latest research progress on 3D generative models for objects, avatars, scenes and motions, including 1) large-scale 3D scene generation, 2) high-quality 3D diffusion for PBR assets, 3) high fidelity 3D avatar generation, 4) egocentric motion learning.