Datasets. We use CIFAR-10, CelebA, ImageNet-64, and LSUN-Church for unconditional generation and MS-COCO-256 for text-conditional generation.
Models. For unconditional generation, we simply adjust the width and depth of the base architecture and train new models. For text-conditional generation, we use the official stable diffusion models v1-1 to v1-4.
Baselines. We use three baselines: (1) the best performance among all single model choices and empirical settings of samplers; (2) the best performance in the training set of the predictor; (3) the performance of randomly sampled model schedules. We choose baseline (1) to show the potential of optimizing model schedule, and choose baseline (2) and (3) to demonstrate the effectiveness of the proposed predictor-based search.
Efficiency evaluation. We report the FID of our searched schedules and all baselines under various budgets. Our results show that we can achieve considerable sample quality with baselines while consuming much less time.
Sample quality of searched schedules