Diffusion Models in Computer Vision
Abstract
Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating impressive results in generative modeling. A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion. Diffusion models are widely appreciated for the quality and diversity of the generated images. In this talk I will present our recent work on how diffusion models can be employed for solving computer vision problems.
First, I will present our work on person image synthesis. In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with strong sample diversity and enhanced mode coverage of the learnt data distribution. Our method disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. Second, I will discuss temporal action segmentation for comprehending human behaviors in complex videos. I will present a framework based on the denoising diffusion model that iteratively produces action predictions starting with random noise, conditioned on the features of the input video. To effectively capture three key characteristics of human actions, namely the position prior, the boundary ambiguity, and the relational dependency, we propose a cohesive masking strategy for the conditioning features. Third, I will present our recent work that underscores the significance of incorporating symmetries into diffusion models, by enforcing equivariance to a general set of transformations within DDPM's reverse denoising learning process. Finally, I will end this talk by showing state of art results on employing diffusion models for solving cloth-changing person re-identification, and limited field of view cross-view geo-localization.