Paste, Inpaint and Harmonize via Denoising: 

Subject-Driven Image Editing with Pre-Trained Diffusion Model

under review...

Abstract

Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions.  However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. 

We introduce a new framework called Paste, Inpaint and Harmonize via Denoising (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions.

In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one.

To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the  pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. 

In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. 

Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks.

PhD: Paste, Inpaint and Harmonize via Denoising


The Training Pipeline of PhD


Figure 3: The illustration presents the self-supervised training pipeline of our proposed PhD.  In the Paste step, we automate region selection using a bounding box. The selected region then undergoes to U2Net application, which facilitates the segmentation and removal of the background. A robust augmentation method yields the pasted image, denoted as \hat{I}_p. In the Inpaint and Harmonize via Denosing step, we employ F_c (UNet) as a feature adapter to extract the low-level subject details, including geometry and texture. By adding the control feature c with the features of the frozen SD, we train the model F_c to generate the input image, I_p, using L2 Loss. Despite requiring less training data, this efficient training pipeline ensures comprehensive and high-quality subject-driven image editing.

Figure 4: The details of the Inpaint and Harmonzie Module (IHM). We proposed our IHM as a feature adapter to encode \hat{I}_p as the feature map (control feature c), which has the same dimensions as the frozen SD features. By adding the control feature c into the features of the frozen SD, IHM is able to change the features of the frozen SD and change the output image of frozen SD.

Subject-Driven Scene Image Editing


Figure 1:  Qualitative results of Subject-Driven Scene Image Editing. Our PhD pipeline demonstrates the capability to generate complicated scenes with relatively big objects. The area outlined by red lines denotes the editing area.

Ablation study of the effect of the textual prompts


Ablation study of the effect of the textual prompts for Subject-Driven Scene Image Editing. Our PhD pipeline demonstrates the capability to generate complicated scenes guided by text prompts. The area outlined by red lines denotes the editing area.

More Results


Subject-Driven Image Editing


Subject-Driven Image Editing

Subject-Driven Image Generation

Quantitative Results

Subject-Driven Image Editing