InstructBooth: Instruction-following Personalized Text-to-image Generation
Daewon Chae, Nokyung Park, Jinkyu Kim*, Kimin Lee*
Korea University, KAIST
*Co-corresponding authors
[Paper] [Code (soon)]
TL;DR : We improve image-text alignment of personalized text-to-image model by introducing subsequent RL fine-tuning.
InstructBooth can generate new images of unseen Phryge, the Paris 2024 Olympic mascot plushie, participating in various sports
Abstract
Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors.
Overview
Our method consists of two main steps: (left) Personalization with a few images of subject, where a pre-trained text-to-image model is fine- tuned with a unique identifier and (right) RL fine-tuning for improving image-text alignment, where we further fine-tune the personalized model to maximize the reward that quantifies image-text alignment.
Comparison with Baselines
We provide the generated samples comparing other baselines. Note that [*] denotes a unique identifier.
The images generated by InstructBooth are preferred at least twice as much as baselines when considering all relevant factors.
Images on Unseen Prompts
We provide the generated samples using unseen prompts which are not utilized during RL fine-tuning of InstructBooth. Note that [*] denotes a unique identifier.
Additional Results
We provide the additional generated examples with unseen prompts comparing other existing methods. Note that [*] denotes a unique identifier.