InstructBooth: Instruction-following Personalized Text-to-image Generation

Daewon Chae, Nokyung Park, Jinkyu Kim*, Kimin Lee*

Korea University, KAIST

*Co-corresponding authors

[Paper] [Code (soon)]

TL;DR : We improve image-text alignment of personalized text-to-image model by introducing subsequent RL fine-tuning.

InstructBooth can generate new images of unseen Phryge, the Paris 2024 Olympic mascot plushie, participating in various sports

Abstract

Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors.

Overview

Our method consists of two main steps: (left) Personalization with a few images of subject, where a pre-trained text-to-image model is fine- tuned with a unique identifier and (right) RL fine-tuning for improving image-text alignment, where we further fine-tune the personalized model to maximize the reward that quantifies image-text alignment.

Comparison with Baselines

We provide the generated samples comparing other baselines. Note that [*] denotes a unique identifier.

The images generated by InstructBooth are preferred at least twice as much as baselines when considering all relevant factors.

Images on Unseen Prompts

We provide the generated samples using unseen prompts which are not utilized during RL fine-tuning of InstructBooth. Note that [*] denotes a unique identifier.

Additional Results

We provide the additional generated examples with unseen prompts comparing other existing methods. Note that [*] denotes a unique identifier.