Zhu Xu1 Zhaowen Wang2 Yuxin Peng1 Yang Liu1*
1Wangxuan Institute of Computer Technology, Peking University
2Adobe Research
ACM-MM 2025
*Corresponding Author
Compositional Customized Image Generation aims to customize multiple target concepts within generation content, which has gained attention for its wild application. Though a great success, existing approaches mainly concentrate on the target entity's appearance preservation, while neglecting the fine-grained interaction control among target entities. To enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation (CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between them. We attribute two primary challenges of CHOI as follows: (1) the simultaneous identity preservation and interaction control demands require the model to decompose the human object into self-contained identity features and pose-oriented interaction features, while the current HOI image datasets fail to provide ideal samples for such feature-decomposed learning. (2) inappropriate spatial configuration between human and object may lead to the lack of desired interaction semantics, as it may provide wrong hints on the human object body parts crucial for interaction semantic expression. To tackle the above issues, we first collect and process a large-scale dataset, where each sample encompasses the same pair of human object involving different interactive poses. Such data is tailored for CHOI training, from where the model can learn how to decompose identity features and interaction features for target human and object. Then to provide appropriate spatial configuration for interaction semantic expression, we design a two-stage model Interact-Custom, which firstly explicitly model the spatial configuration by generating a foreground mask depicting the interaction behavior, then under the guidance of this mask, we generate the target human object interacting while preserving their identities features. Furthermore, if the background image and the union location of where the target human object should appear are provided by users, Interact-Custom also provides the optional functionality to specify them, offering high content controllability. Extensive experiments on our tailored metrics for CHOI task demonstrate the effectiveness of our approach.
The overall pipeline of Interact-Custom. In Interaction-Aware Mask Generation (IAMG) Stage, the Generation-Based model serves prompt Tinter as the conditions, generating human-object mask Mfore that accurately conveys target interaction semantic. Mfore is adopted as spatial configuration guidance for image generation. In Mask Guided Image Generation (MGIG) stage, Ih and Io are utilized to extract ID features and fine-grained detail features Fhigh, serving as disentangled self-contained identities features. Mfore is then incorporated to guide to express interaction semantic. Optionally, the background image Ibg and location B to specify the union region of the human-object interaction can serve as additional input to control the background and specific location.
Results
Performance on subject customization
Performance on Interaction semantic control
Visualizations
Qualitative comparison of different approaches
Qualitative results of our approach with different conditions