Integrated framework for text to image generation (LLM + Diffusion)

Project done as part of the course 16-824 (Visual Learning and Recognition) at Carnegie Mellon University (Fall 2023)

This project explores the exciting realm of text-to-comic generation by leveraging machine learning techniques, with a specific focus on incorporating stable diffusion models and large language models. The primary goal is to create a comprehensive pipeline capable of transforming text prompts into visually expressive comic illustrations. The project addresses key challenges such as maintaining consistent art style across panels and enabling style changes within the comic.

The project's design encompasses a two-stage pipeline: story generation using a large language model (LLM) and image generation through a stable diffusion model with fine-tuning capabilities. The LLM, namely LlaMA-2-7b-chat, is employed to generate creative and coherent narratives based on user-provided prompts. The generated story is then used to formulate prompts for the diffusion model. The diffusion model, fine-tuned using Dreambooth, is tasked with generating comic-style images guided by the textual prompts. The overall pipeline incorporates ControlNet for style transfer, ensuring uniformity across panels.

This project presents a promising solution for text-to-comic generation, combining the strengths of language models and diffusion models. The pipeline successfully generates comic panels with consistent art styles, offering a versatile tool for creative storytelling. While the current version exhibits promising results, future work could focus on refining the model, implementing quantitative evaluations, and automating the combination of text and panels to enhance overall usability. The project's success opens avenues for further exploration and development in the exciting intersection of machine learning and comic creation.

Several experiments were conducted to evaluate the performance of the pipeline. Story generation by the LLM demonstrated improved results after fine-tuning, showcasing its capacity for generating engaging narratives. The diffusion model experiments explored various parameters, including guidance scale and inference steps, with the aim of achieving optimal style transfer. The results indicated that a guidance scale of 7.5 and 60 inference steps yielded superior outcomes in terms of both style and content preservation.

Page updated

Google Sites

Report abuse

Integrated framework for text to image generation (LLM + Diffusion)

Some Additional Links