Sketch and Text Guided Image Using GANs
In computer vision, generating realistic images from hand-drawn sketches is a compelling task, made more powerful when enhanced with text descriptions. This project proposes a GAN-based model that takes both sketch and text input to synthesize detailed images. Traditional models like Pix2Pix rely solely on sketches, missing semantic cues from text. To address this, we integrate text embedding (from BERT or pooling layers) into the generator. We also incorporate perceptual loss using VGG19 to preserve high-level visual features. This approach produces semantically rich and visually accurate images, suitable for applications like digital art and facial reconstruction.
The dataset used for this project consists of:
30,000 sketches generated using the Canny edge detection method,
30,000 corresponding text prompts describing visual features such as hairstyle, facial expression, or accessories,
30,000 original face images serving as ground truth references.
Each sketch-text-image triplet forms a complete training sample. The diversity in sketches and textual descriptions ensures that the model learns to associate visual structures with semantic context, which enhances the quality of the generated images during inference.
To effectively translate sketches into realistic images while incorporating semantic information from textual descriptions, we propose an enhanced Text-Conditioned Pix2Pix GAN architecture. This model builds upon the traditional Pix2Pix framework by integrating text features and applying perceptual losses for improved realism.
The generator is based on a U-Net encoder-decoder structure, modified to include text features:
Input size: 256×256×1 (grayscale sketch)
Convolutional encoder compresses the image into a latent representation.
Skip connections pass fine-grained details to the decoder.
Text prompts are embedded using a pre-trained BERT model.
The BERT output is passed through:
Linear projection to match sketch latent size
Sketch and text embeddings are concatenated at the bottleneck.
Decoder upsamples this joint representation into a full-resolution image.
Output: 256×256×3 RGB image
A Patch GAN discriminator is used to focus on local realism:
Input: Real/generated image concatenated with the input sketch
Output: 30×30 patch-level real/fake probabilities
This localized structure helps detect high-frequency artifacts.
To guide the generator in producing realistic and semantically aligned images, the following losses are combined:
Adversarial Loss (GAN Loss): Encourages realism by fooling the discriminator.
L1 Loss: Penalizes pixel-wise differences between the generated and real image.
Perceptual Loss (VGG19): Compares high-level features from pre-trained VGG19 layers.
Text Consistency Loss (optional): Encourages semantic alignment between generated image and text (computed using cosine similarity of CLIP or BERT features).
Optimizer: Adam (β₁=0.5, β₂=0.999)
By comparing these three approaches, the project highlights the trade-offs between control, complexity, and output quality. Pix2Pix offers simplicity and fast training, the GAN + VGG19 + BERT model provides semantic accuracy with multi-modal input, and StyleGAN2-ADA delivers unmatched realism. This comprehensive evaluation showcases the potential of combining generative models with auxiliary networks for creative and intelligent sketch-to-image applications.