Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu
Zoom Communications
CVPR 2025
Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu
Zoom Communications
CVPR 2025
Visual Comparison (see HD version here)
Applying existing RL methods (DDPO, Diffusion-DPO, etc.) to fine-tune step-distilled diffusion models (DMs) is challenging for high-resolution (1024^2) ultra-fast (<= 2 steps) image generation. We pinpoint the underlying challenges and propose to fine-tune 2-step DMs with learned differentiable surrogate rewards in the latent space. Our method, LaSRO, leverages pre-trained latent DMs to estimate reward gradient effectively and tailors reward optimization for <=2-step image generation with efficient off-policy exploration. LaSRO successfully improves ultra-fast image generation with different reward signals (including non-differentiable ones).
Additional Visual Comparison
Analysis of Challenges in RL Fine-tuning 2-step DMs
• Hard exploration for 2-step image DMs
• Insufficient on-policy exploration due to reduced sampling steps in generation
• Degenerated RL objectives for 2-step DMs
• The likelihood function of the generated data is not well-defined for 2step DMs
• Non-smooth mappings that underlie 2-step DMs
• Step-distilled DMs are highly non-smooth -> high variance in policy gradient estimation
(left) When given the same text prompt and initial noise, reducing the number of sampling steps significantly decreases the diversity of the generated images, resulting in inadequate on-policy exploration. (right) Empirical verification shows that fewer generation steps make step-distilled DMs less smooth (thus harder to do policy gradient), as each step covers a broader timestep range.
LaSRO: Latent-Space Surrogate Reward Optimization
We propose LaSRO, which addresses these challenges by leveraging pre-trained latent DMs for latent-space reward modeling and by facilitating efficient optimization of 2-step DMs with off-policy exploration. LaSRO's training process includes:
Pre-train the latent-space surrogate reward using a target reward signal (potentially non-differentiable).
Alternate between reward fine-tuning the 2-step DMs via the surrogate and online adaptation of the surrogate.
Illustration of the 2nd training stage of LaSRO, in the style of actor-critic methods. LaSRO benefits from (1) robust and effective reward gradient estimation from the surrogate and (2) an off-policy exploration strategy that is sample-efficient.
The associated CVPR 2025 paper of this project has the following BibTeX
@article{jia2024lasro,
title={Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward},
author={Zhiwei Jia and Yuesong Nan and Huixi Zhao and Gengdai Liu},
journal={arXiv preprint arXiv:2411.15247},
year={2024}
}