FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Xuehai He 1, Jian Zheng 2, Jacob Zhiyuan Fang 2, Robinson Piramuthu 2, Mohit Bansal 3, Vicente Ordonez 4, Gunnar A Sigurdsson 2, Nanyun Peng 5, Xin Eric Wang1

 1UC Santa Cruz, 2Amazon, 3UNC Chapel Hil,  4Rice University, 5University of California, Los Angeles

Abstract

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with UniControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities

 Results

Qualitative performance of FlexEControl when conditioning on diverse compositions of multiple modalities. Each row in the figure corresponds to a unique type of condition, with the text prompts and conditions as follows: (first row) two canny edge maps with the prompt A motorcycle in the forest, (second row) two depth maps for A car, (third row) two sketch maps depicting A vase with a green apple, (fourth row) dual canny edge maps for Stormtrooper’s lecture at the football field, (fifth row) two segmentation maps visualizing A deer in the forests, (sixth row) two MLSD edge maps for A sofa in a desert, and (seventh row) one segmentation map and one edge map for A bird. These examples illustrate the robust capability of FlexEControl to effectively utilize multiple multimodal conditions, generating images that are not only visually compelling but also faithfully aligned with the given textual descriptions and input conditions. 

Approach

Overview of FlexEControl: a decomposed green matrix is shared across different input conditions, significantly enhancing the model’s efficiency. During training, we integrate two specialized loss functions to enable flexible control and to adeptly manage conflicting conditions. In the example depicted here, the new parameter size is efficiently condensed to 4 + 6n, where n denotes the number of decomposed matrix pairs. 

Questions?

Contact Xuehai He with xhe89@ucsc.edu to get more information on the project

@article{he2024flexecontrol,

  title={FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation},

  author={He, Xuehai and Zheng, Jian and Fang, Jacob Zhiyuan and Piramuthu, Robinson and Bansal, Mohit and Ordonez, Vicente and Sigurdsson, Gunnar A and Peng, Nanyun and Wang, Xin Eric},

  journal={arXiv preprint arXiv:2405.04834},

  year={2024}

}