The banner image was generated via one of powerful diffusion models, FLUX, with prompt: dynamics flows of fluid
While diffusion models now power many computer vision systems, existing tutorials are often fragmented: either implementation-driven or surveys of sub-areas without a unifying framework. This tutorial distills two core principles behind modern diffusion (the change-of-variable formula for generative process and the conditioning trick for constructing tractable regression targets) and focuses on what is increasingly decisive for vision: real-time generation via flow map models, a diffusion-motivated family of generative models. We connect training objectives, sampling dynamics, and practical accelerations into a coherent view, and translate it into actionable guidance for latency-constrained tasks such as image synthesis, editing, and video generation.
The same change-of-variable principle also extends naturally to tokenized generative modeling. We further cover tokenized models, including discrete diffusion, as a bridge from continuous pixels to cross-modal/multi-modal generation. This perspective unifies text, vision, and other modalities through shared discrete representations, clarifying why discrete diffusion matters for computer vision and how it enables capabilities beyond standard continuous diffusion pipelines.
Meta Superintelligence Labs (MSL)
Adobe
Sony AI
Stanford University
🎹 Please freely contact to Chieh-Hsin (Jesse) Lai via [chieh-hsin.lai@sony.com / chiehhsinlai@gmail.com] and Subham Sahoo via [ssahoo@cs.cornell.edu] for any question or concern.