OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models

Enshu Liu*, Xuefei Ning*, Zinan Lin*, Huazhong Yang, Yu Wang

Let us accelerate the end-to-end generation of Stable Diffusion by 2x for free!

[Paper (ICML 2023)] [BibTex] [Code]

Abstract

Diffusion probabilistic models (DPMs) are a new class of generative models that have achieved state-of-the-art generation quality in various domains. Despite the promise, one major drawback of DPMs is the slow generation speed due to the large number of neural network evaluations required in the generation process. In this paper, we reveal an overlooked dimension---model schedule---for optimizing the trade-off between generation quality and speed. More specifically, we observe that small models, though having worse generation quality when used alone, could outperform large models in certain generation steps. Therefore, unlike the traditional way of using a single model, using different models in different generation steps in a carefully designed model schedule could potentially improve generation quality and speed simultaneously. We design OMS-DPM, a predictor-based search algorithm, to optimize the model schedule given an arbitrary generation time budget and a set of pre-trained models. We demonstrate that OMS-DPM can find model schedules that improve generation quality and speed than prior state-of-the-art methods across CIFAR-10, CelebA, ImageNet, and LSUN datasets. When applied to the public checkpoints of the Stable Diffusion model, we are able to accelerate the sampling by 2× while maintaining the generation quality.

Background

Diffusion probabilistic models (DPMs) train a neural network to estimate the time-conditioned score (or other Mathematically equivalent forms) of the target data distribution perturbed by a forward diffusion stochastic process. By solving the corresponding reverse SDE or ODE, DPMs can draw samples from the target distribution.

DPMs have higher generation quality than traditional generative models and have made great success in various generation tasks. However, their slow generation speed draws resistance to the practical application, often hundreds of times slower than generative adversarial networks (GANs).

There are two major causes of DPM's slow generation speed: (1) too many steps of function evaluation by the trained neural network are needed to solve the reverse DE precisely; (2) each step is time-consuming since the neural network is usually computationally heavy.

Existing works accelerate DPMs from the above two perspectives respectively. A bunch of literature has been working on designing samplers to decrease the number of function evaluations (NFE). Some other works focus on the design of small neural networks.

Motivation of model schedule

We start by training a set of neural networks which have different sizes (thus, different inference latency). We test the sample quality of these models. Unsurprisingly, models with lower inference latency have poorer sampling quality (see the left sub-figure below), due to their low model capacities. Then we test the denoising loss of these models at different timesteps, finding that large models do not always outperform small models at all timesteps (see the middle sub-figure), e.g., Arch 6 is larger than Arch 1, but has higher denoising loss on 0-50 timesteps. Since a smaller denoising loss does not always indicate better generation quality, we further randomly mix these models across different timesteps and get better generation quality than using the largest model (Arch 6) at all timesteps.

Motivated by the above observations, we propose this work. Different from the traditional DPMs, which use a single neural network to evaluate the score function at all timesteps, this work proposes to use different neural networks at different timesteps, thus drawing a new dimension called model schedule. We focus on the following question: Given a set of pre-trained DPM models and a generation time budget, how can we find the model schedule that optimizes the generation quality? By solving such questions, we achieve an impressive acceleration ratio without any retraining of DPMs.

Optimizing model schedule

Definition of optimization space

Since the model schedule mainly foucses on the selection of models at each timestep, it is natural and necessary to combine our methods with existing samplers. We choose DPM-Solver and DDIM for our experiments while keeping it fesiable to apply to other samplers. By introducing special "null models", which indicates a skipped timestep with no latency, we take sampler-specific decisions into account, e.g., timesteps schedules and solver orders. The figure below demonstrates the search space with the two samplers.

In conclusion, our optimization space includes: (1) the choice of model at each timesteps; (2) timestep schedules; (3) solver orders (for high-order solver, e.g., DPM-Solver)

Optimization method: predictor-based evolutionary search

Searching the optimal model schedule directly is very difficult since the space is very large, and evaluating one model schedule consumes much time due to the slow generation speed of DPMs. Inspired by predictor-based neural architecture search (NAS), we propose to use a predictor, which takes parameterized model schedules as input and predicts their sample quality. Our framework is demonstrated below:

The framework contains three steps:

Prepare training data for the predictor. We randomly sample a small amount of model schedules and evaluate their sample quality to construct a training set.
Train the predictor.
Use the predictor to guide an evolutionary search process under certain generation latency constraints.

Results

Datasets. We use CIFAR-10, CelebA, ImageNet-64, and LSUN-Church for unconditional generation and MS-COCO-256 for text-conditional generation.

Models. For unconditional generation, we simply adjust the width and depth of the base architecture and train new models. For text-conditional generation, we use the official stable diffusion models v1-1 to v1-4.

Baselines. We use three baselines: (1) the best performance among all single model choices and empirical settings of samplers; (2) the best performance in the training set of the predictor; (3) the performance of randomly sampled model schedules. We choose baseline (1) to show the potential of optimizing model schedule, and choose baseline (2) and (3) to demonstrate the effectiveness of the proposed predictor-based search.

Efficiency evaluation. We report the FID of our searched schedules and all baselines under various budgets. Our results show that we can achieve considerable sample quality with baselines while consuming much less time.

Sample quality of searched schedules

Results with DPM-Solver sampler

Results of on MS-COCO 256×256 dataset using Stable-Diffusion models

Demonstration of searched patterns

We demonstrate several model schedules searched by OMS-DPM as an example. More can be found in our paper.

Search results on ImageNet-64 with DDIM sampler

Search results on LSUN-Church with DDIM sampler

New insights

We offer some new insights for designing effective model schedules.

Use small models to exchange more timesteps when the budget is very low. Conversely, when the budget is high, using large models is recommended even if this leads to fewer timesteps.
For low-resolution datasets, putting large models at timesteps near the generated data is more likely to get better quality; For high-resolution datasets, putting large models at timesteps near the initial noise is better.

Demo

We choose PickScore as the evaluation metric and use our proposed OMS-DPM to compress the baseline from 50 steps (left) to 9 steps (right), achieving 5× acceleration. Additionally, we further optimize the sampling speed using TensorRT, achieving an additional 2× acceleration. In total, we achieve an acceleration ratio of 10× compared with default settings, without sacrificing the generation quality.

Limitations and Prospects

Limitations and Future Work

The current framework is not efficient enough to quickly apply to new datasets/tasks.
Theoretical analysis about when and why a DPM model favors certain denoising steps is limited.
Methods of efficiently constructing a model zoo may be a direction worth exploring.

Broader Perspective

As more and more open-source or proprietary models and APIs with varying expertise and complexity are coming forth, we believe the idea of cleverly combining off-the-shelf models and APIs to improve performance-efficiency trade-offs can support a wider range of applications.

Acknowledgements

We thank Prof. Jianfei Chen, Tianchen Zhao, Junbo Zhao for their discussion. We thank Zixuan Zhou and Tianchen Zhao for their efforts on making the demo.

BibTex

@InProceedings{liu2023oms,

title={{OMS}-{DPM}: Optimizing the Model Schedule for Diffusion Probabilistic Models},

author={Liu, Enshu and Ning, Xuefei and Lin, Zinan and Yang, Huazhong and Wang, Yu},

booktitle={Proceedings of the 40th International Conference on Machine Learning},

pages={21915--21936},

year={2023},

organization={PMLR}

}

Page updated

Google Sites

Report abuse