Sketch and Text Guided Image Using GANs

Introduction

In computer vision, generating realistic images from hand-drawn sketches is a compelling task, made more powerful when enhanced with text descriptions. This project proposes a GAN-based model that takes both sketch and text input to synthesize detailed images. Traditional models like Pix2Pix rely solely on sketches, missing semantic cues from text. To address this, we integrate text embedding (from BERT or pooling layers) into the generator. We also incorporate perceptual loss using VGG19 to preserve high-level visual features. This approach produces semantically rich and visually accurate images, suitable for applications like digital art and facial reconstruction.

Methodology

Dataset Description

The dataset used for this project consists of:

30,000 sketches generated using the Canny edge detection method,
30,000 corresponding text prompts describing visual features such as hairstyle, facial expression, or accessories,
30,000 original face images serving as ground truth references.

Each sketch-text-image triplet forms a complete training sample. The diversity in sketches and textual descriptions ensures that the model learns to associate visual structures with semantic context, which enhances the quality of the generated images during inference.

Model Architecture

To effectively translate sketches into realistic images while incorporating semantic information from textual descriptions, we propose an enhanced Text-Conditioned Pix2Pix GAN architecture. This model builds upon the traditional Pix2Pix framework by integrating text features and applying perceptual losses for improved realism.

1. Generator Network

The generator is based on a U-Net encoder-decoder structure, modified to include text features:

Sketch Input Pathway:
- Input size: 256×256×1 (grayscale sketch)
- Convolutional encoder compresses the image into a latent representation.
- Skip connections pass fine-grained details to the decoder.
Text Input Pathway:
- Text prompts are embedded using a pre-trained BERT model.
- The BERT output is passed through:
  - Dense Layer → ReLU
  - GlobalAveragePooling1D
  - Linear projection to match sketch latent size
Fusion Mechanism:
- Sketch and text embeddings are concatenated at the bottleneck.
- Decoder upsamples this joint representation into a full-resolution image.
Output: 256×256×3 RGB image

2. Discriminator Network

A Patch GAN discriminator is used to focus on local realism:

Input: Real/generated image concatenated with the input sketch
Output: 30×30 patch-level real/fake probabilities
This localized structure helps detect high-frequency artifacts.

3. Loss Functions

To guide the generator in producing realistic and semantically aligned images, the following losses are combined:

Adversarial Loss (GAN Loss): Encourages realism by fooling the discriminator.
L1 Loss: Penalizes pixel-wise differences between the generated and real image.
Perceptual Loss (VGG19): Compares high-level features from pre-trained VGG19 layers.
Text Consistency Loss (optional): Encourages semantic alignment between generated image and text (computed using cosine similarity of CLIP or BERT features).

4. Training Strategy

Optimizer: Adam (β₁=0.5, β₂=0.999)
Learning Rate: 0.0002
Batch Size: 16
Number of Epochs: 200

Results

By comparing these three approaches, the project highlights the trade-offs between control, complexity, and output quality. Pix2Pix offers simplicity and fast training, the GAN + VGG19 + BERT model provides semantic accuracy with multi-modal input, and StyleGAN2-ADA delivers unmatched realism. This comprehensive evaluation showcases the potential of combining generative models with auxiliary networks for creative and intelligent sketch-to-image applications.

Model Output

Page updated

Report abuse

Sketch and Text Guided Image Using GANs

Introduction

Methodology

Dataset Description

The dataset used for this project consists of:

30,000 sketches generated using the Canny edge detection method,

30,000 corresponding text prompts describing visual features such as hairstyle, facial expression, or accessories,

30,000 original face images serving as ground truth references.

Each sketch-text-image triplet forms a complete training sample. The diversity in sketches and textual descriptions ensures that the model learns to associate visual structures with semantic context, which enhances the quality of the generated images during inference.

Model Architecture

1. Generator Network

The generator is based on a U-Net encoder-decoder structure, modified to include text features:

Sketch Input Pathway:

Input size: 256×256×1 (grayscale sketch)

Convolutional encoder compresses the image into a latent representation.

Skip connections pass fine-grained details to the decoder.

Text Input Pathway:

Text prompts are embedded using a pre-trained BERT model.

The BERT output is passed through:

Dense Layer → ReLU

GlobalAveragePooling1D

Linear projection to match sketch latent size

Fusion Mechanism:

Sketch and text embeddings are concatenated at the bottleneck.

Decoder upsamples this joint representation into a full-resolution image.

Output: 256×256×3 RGB image

2. Discriminator Network

A Patch GAN discriminator is used to focus on local realism:

Input: Real/generated image concatenated with the input sketch

Output: 30×30 patch-level real/fake probabilities

This localized structure helps detect high-frequency artifacts.

3. Loss Functions

To guide the generator in producing realistic and semantically aligned images, the following losses are combined:

Adversarial Loss (GAN Loss): Encourages realism by fooling the discriminator.

L1 Loss: Penalizes pixel-wise differences between the generated and real image.

Perceptual Loss (VGG19): Compares high-level features from pre-trained VGG19 layers.

Text Consistency Loss (optional): Encourages semantic alignment between generated image and text (computed using cosine similarity of CLIP or BERT features).

4. Training Strategy

Optimizer: Adam (β₁=0.5, β₂=0.999)

Learning Rate: 0.0002

Batch Size: 16

Number of Epochs: 200

Results

Model Output