One-Step Offline Distillation of Diffusion-Based Models via Koopman Modeling

Github | Arxiv

Nimrod Berman* 🐪, Ilan Naiman* 🐪, Moshe Eliasof 🏛️ , Hedi Zisling 🐪, Omri Azencot 🐪
🐪 Ben Gurion University, 🏛️ University of Cambridge
*equal contribution

State-Of-The-Art EDM Model With 79 Steps VS.

Our KDM Model In 1 Step

Why Distilling Diffusion Models Matters?

Diffusion-based generative models, including score-based and flow-matching variants, have become ubiquitous across a wide range of domains from images and videos to audio and time series. These models now surpass traditional approaches such as GANs and VAEs in terms of sample quality, while also exhibiting greater training stability. Despite these advantages, one of their key limitations remains the high computational cost associated with sampling. Generating a single high-fidelity sample typically requires executing a lengthy iterative process, progressively refining random noise through dozens or even hundreds of model evaluations.

A widely adopted strategy to address this involves minimizing the number of inference steps, for example by leveraging improved numerical solvers or designing more expressive noise schedules, enabling high-quality generation with significantly fewer denoising iterations. Another increasingly popular alternative is distillation, where a student model learns to emulate the behavior of a larger or more accurate teacher model. Distillation approaches vary in supervision and setup: in online distillation, the student directly learns from the teacher’s predictions during training, with the teacher providing trajectory information across time steps. In contrast, offline distillation relies on precomputed noise-image pairs generated by the teacher, and the student learns without further querying the teacher model.

Why Offline Distillation?

Offline distillation presents significant benefits compared to online methods. By training on pre-calculated noise-image pairs, it bypasses the necessity for complex, multi-stage training procedures. This approach also minimizes the computational demands on the teacher model, as it doesn't require extensive neural function evaluations during the student's learning process. Furthermore, offline distillation provides a crucial opportunity to filter out any potentially privacy-violating synthetic data generated by the teacher *before* the distillation process even begins. Finally, it removes the resource-intensive requirement of keeping the large teacher model in memory throughout the student's optimization. These advantages culminate in a training paradigm that is more privacy-preserving, highly scalable, inherently asynchronous, and enables broader reuse of the distilled data across various student model architectures and different application scenarios .

The Structure of Generative Dynamics of Denoising Models

It is well established that denoising generative models—such as diffusion models and flow matching models—can be described by stochastic differential equations (SDEs) or ordinary differential equations (ODEs). During training, these models define a forward noising process via a known dynamical system and learn its corresponding reverse process to gradually denoise the data. Once training is complete, sampling new data points becomes possible by simulating the learned reverse dynamics.

From this viewpoint, these models effectively learn a dynamical system that transports data from one distribution to another. Motivated by this perspective, we initiate a novel investigation into the geometric structure of the learned dynamics. Specifically, we ask: Does the initial noisy state uniquely determine the final data point? Our findings indicate that it does. As shown above, before training, noisy samples are scattered arbitrarily. After training, a coherent structure emerges—highlighting the structure of the learned flow.

Real-world-data Dynamics: Local Structure Analysis.

Extending the toy experiment setup, we now explore the high-dimensional FFHQ 64x64 dataset. Instead of dimensionality reduction, which can hide fine-grained patterns, we analyze the local neighborhood of a single latent sample drawn from a standard normal distribution. We create perturbed versions by adding noise, scaled by σ, to this sample and then normalize each perturbed version using its own mean and standard deviation. This setup tests if meaningful semantic structure arises locally. We vary σ and visualize the results in Figure (left). For σ values less than 0.3, the generated images maintain semantic similarity to the original, suggesting local smoothness and adherence to the data manifold. As the noise level σ increases, the semantics diverge, but they still retain a global coherence even at the highest noise level tested (0.60). We performed the same experiment on our distilled model (right) and observed similar behavior. This confirms that our method, KDM, preserves the local semantic structure of the original EDM. Our findings demonstrate that both the original model and our distilled model organize their latent spaces into meaningful neighborhoods, even in complex, high-dimensional settings like FFHQ.

Levraging the Structure for Koopman Distillation Models (KDM)

Koopman's theory proposes that a nonlinear dynamical system can be represented as a linear system when expressed in an embedded space. To move data into and out of this space, we can learn/use linear/nonlinear transformation functions—typically referred to as an encoder (“E”) and a decoder (“D”). Once in this transformed space, the system’s evolution becomes linear and can be modeled using simple operations like matrix multiplication.

Previous research has shown that in certain types of dynamical systems, the Koopman operator preserves the structural properties of the original nonlinear dynamics. In our setup, we theoretically prove that this structure-preserving behavior also holds under mild conditions. The consistent latent structure we observe, along with insights from prior work and our own theoretical findings, motivates the use of Koopman theory in our approach.

But one might ask: is such a Koopman operator guaranteed to exist?

The answer is yes. Although our setup requires a finite Koopman matrix, whereas the theory suggests that the matrix is infinite, in our work, we provide theoretical proof that a finite approximation of the Koopman operator does indeed exist. This result offers a solid foundation for applying Koopman-based modeling in our setting and justifies its practical and theoretical use.

Offline Distillation With KDM

Diffusion-based generative models have established new performance benchmarks across various domains, yet their practical deployment is hindered by computationally expensive sampling procedures. Our approach can facilitate a solution for that. Imitating the non-linear system with a learned observable and a matrix allows us to reduce the number of steps as we present in our paper. Below we show a comparison between our model and the teacher model that we distilled:

Can you tell who is the teacher and who is the student ? Well, the left example is the teacher and the right is the student. Furthermore, we can see that the KDM has learned the dynamical structure of the original model pretty well since all corresponding images started from the same noise (initial state) and ending almost in the same ending state.

Efficiency Analysis

Offline distillation offers clear advantages over online methods. Memory-wise, online approaches require both teacher and student models to be loaded simultaneously, increasing memory usage and training overhead. Offline methods avoid this by training solely on student trajectories, reducing both compute and memory demands. Time-wise, offline distillation is also more teacher-efficient. Training GET on 1M CIFAR-10 samples requires 35M teacher forward passes (NFEs), while online methods often require 76–200M NFEs due to multiple steps per trajectory. Our method further improves over the state-of-the-art offline approach, GET. As shown in the table below, it matches GET in parameter count but achieves a 4x speedup per training iteration and over 8x faster sampling. Notably, our method maintains this efficiency even with added spectral regularization, incurring no extra cost. These results underscore the practical scalability of our approach.

Cite Us

@misc{berman2025onestepofflinedistillationdiffusionbased,

title={One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling},

author={Nimrod Berman and Ilan Naiman and Moshe Eliasof and Hedi Zisling and Omri Azencot},

year={2025},

eprint={2505.13358},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2505.13358},

}

Page updated

Google Sites

Report abuse