Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu

Zoom Communications

CVPR 2025

Visual Comparison (see HD version here)

Applying existing RL methods (DDPO, Diffusion-DPO, etc.) to fine-tune step-distilled diffusion models (DMs) is challenging for high-resolution (1024^2) ultra-fast (<= 2 steps) image generation. We pinpoint the underlying challenges and propose to fine-tune 2-step DMs with learned differentiable surrogate rewards in the latent space. Our method, LaSRO, leverages pre-trained latent DMs to estimate reward gradient effectively and tailors reward optimization for <=2-step image generation with efficient off-policy exploration. LaSRO successfully improves ultra-fast image generation with different reward signals (including non-differentiable ones).

Additional Visual Comparison

Analysis of Challenges in RL Fine-tuning 2-step DMs

• Hard exploration for 2-step image DMs

• Insufficient on-policy exploration due to reduced sampling steps in generation

• Degenerated RL objectives for 2-step DMs

• The likelihood function of the generated data is not well-defined for 2step DMs

• Non-smooth mappings that underlie 2-step DMs

• Step-distilled DMs are highly non-smooth -> high variance in policy gradient estimation

(left) When given the same text prompt and initial noise, reducing the number of sampling steps significantly decreases the diversity of the generated images, resulting in inadequate on-policy exploration. (right) Empirical verification shows that fewer generation steps make step-distilled DMs less smooth (thus harder to do policy gradient), as each step covers a broader timestep range.

LaSRO: Latent-Space Surrogate Reward Optimization

We propose LaSRO, which addresses these challenges by leveraging pre-trained latent DMs for latent-space reward modeling and by facilitating efficient optimization of 2-step DMs with off-policy exploration. LaSRO's training process includes:

Pre-train the latent-space surrogate reward using a target reward signal (potentially non-differentiable).
Alternate between reward fine-tuning the 2-step DMs via the surrogate and online adaptation of the surrogate.

Illustration of the 2nd training stage of LaSRO, in the style of actor-critic methods. LaSRO benefits from (1) robust and effective reward gradient estimation from the surrogate and (2) an off-policy exploration strategy that is sample-efficient.

Numerical Results (more in our paper)

LaSRO significantly outperforms existing RL methods for 2-step image DMs in improving several reward signals. Moreover, we find it agnostic to the step-distillation techniques (LCM, ADD, etc.).

The associated CVPR 2025 paper of this project has the following BibTeX

@article{jia2024lasro,

title={Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward},

author={Zhiwei Jia and Yuesong Nan and Huixi Zhao and Gengdai Liu},

journal={arXiv preprint arXiv:2411.15247},

year={2024}

}

Page updated

Google Sites

Report abuse