Semantic-Aware Human Object Interaction Image Generation
Semantic-Aware Human Object Interaction Image Generation
Zhu Xu1 Qingchao Chen2 Yuxin Peng1 Yang Liu1*
1Wangxuan Institute of Computer Technology, Peking University
2National Institute of Health Data Science, Peking University
Recent text-to-image generative models have demonstrated remarkable abilities in generating realistic images. Despite their great success, these models struggle to generate high-fidelity images with prompts oriented toward human-object interaction (HOI). The difficulty in HOI generation arises from two aspects. Firstly, the complexity and diversity of human poses challenge plausible human generation. Furthermore, untrustworthy generation of interaction boundary regions may lead to deficiency in HOI semantics. To tackle the problems, we propose a Semantic-Aware HOI generation framework SA-HOI . It utilizes human pose quality and interaction boundary region information as guidance for denoising process, thereby encouraging refinement in these regions to produce more reasonable HOI images. Based on it, we establish an iterative inversion and image refinement pipeline to continually enhance generation quality. Further, we introduce a comprehensive benchmark for HOI generation, which comprises a dataset involving diverse and fine-grained HOI categories, along with multiple custom-tailored evaluation metrics for HOI generation. Experiments demonstrate that our method significantly improves generation quality under both HOI-specific and conventional image evaluation metrics.
SA-HOI contains two designs: Pose and Interaction Boundary Guidance (PIBG), Iterative Inversion and Refinement (IIR). In PIBG, Low-quality Pose Mask and interaction boundary mask are adopted to guide the Gaussian Blurring process of image latent feature in each denoising step, with the aim to refine low pose quality part and interaction boundary region of original image. IIR gradually enhances generation quality based on inversion model N and PIBG. Our benchmark includes a Dataset of realistic images covering human-object, human-animal, and human-human interactions, as well as comprehensive HOI Evaluation Metrics reflecting authenticity, plausibility and fidelity of generated HOI images.
@inproceedings{xusemantic,
title={Semantic-Aware Human Object Interaction Image Generation},
author={Xu, Zhu and Chen, Qingchao and Peng, Yuxin and Liu, Yang},
booktitle={Forty-first International Conference on Machine Learning},
year= {2024}
}