Text-to-Image generative model components and process steps
Step 1: Prompt and Text Encoder (CLIP/T5)
Text Encoder converts a text prompt into a conditioning vector (embedding) that the diffusion model can understand.
There are usually two embeddings:
Positive embedding: what you want the image to contain
Negative embedding: what you want the image to avoid (or an empty prompt)
[Text Prompt] -> [Text Encoder] -> [positive embedding]
[Negative Prompt] -> [Text Encoder] -> [negative embedding]
The embedding captures semantic meaning, not grammar. Similar prompts have similar embeddings.
Step 2: VAE, initial image, and latent space
VAE (Variational Autoencoder) acts as a compressor.
It transforms high‑resolution pixel arrays into a latent vector (a compressed feature space).
A latent vector can be coefficients of multiple gaussian distributions of raw data, similar to Fourier transformation in some way.
The VAE has two parts:
Encoder: compresses raw pixel data into a latent vector (a statistical representation)
Decoder: reconstructs high‑resolution pixels from the latent vector
This significantly reduce the size of data from original pixels to a concise summaries. Some information is lost but computation becomes much faster and more efficient.
Initially, an image of pure white (Gaussian) noise is generated.
This noise image is converted by VAE into a latent space representation, which becomes the starting point for the diffusion process below.
Step 3: Diffusion Process (T Steps of Denoising)
The diffusion process gradually denoises an initial white noise into an target image (within latent space).
This removes noises step by step guided by the text prompt(s).
Here it uses time step t conceptually but there is no time dimension for text-to-image as opposed to text-to-video.
Key components:
[Scheduler]
The noise level is different at each timestep t.
The Scheduler defines how noise decreases over time (e.g. linear, cosine, Karras).
It provides signal and noise scaling factors for each step.
[Denoiser (U-Net)]
Intuition: (clean image latent) + (known noise latent) -> (noisy image latent)
A denoiser predicts the (known noise latent) from the (noisy image latent) based on prompt (positive or negative).
Given a noisy image's latent vector, denoiser outputs a latent vector predicting the (known noise latent) based on the prompt.
Conceptually: (noisy image latent) - (known noise latent) -> (cleaner image latent)
Mathematically, this is not a direct subtraction; the noise prediction must be rescaled using signal and noise factors from the Scheduler.
[CFG (Classifier‑Free Guidance)]
The denoiser produces two noise latents:
One conditioned on the positive prompt
One conditioned on the negative prompt
CFG blends the two noise latents with w*latent_positive + (1-w)*latent_negative, where w can be > 1.
This can rewritten as: latent_negative + w*(latent_positive - latent_negative)
In the vector space, (latent_positive - latent_negative) is a vector pointing from negative latent to positive latent.
Interpretation:
The vector (latent_positive − latent_negative) points from “what you don’t want” toward “what you want”.
w controls how strongly the denoising is pushed in that direction.
[Sampler]
A mathematical step that conceptually subtracts a noise latent from a noisy image latent, to produce a cleaner image latent.
The noise latent is the blended noise latent from CFG:
(cleaner image latent) = (noisy image latent) - (blended noise latent)
This does the math calculation and rescales (blended noise latent) based on the noise and signal scaling factors from Scheduler.
The diffusion loop:
current image latent = initial white‑noise latent
For each step t in the scheduler’s timestep list:
positive prompt + current image latent -> [denoiser] -> conditional noise prediction (latent_positive)
negative prompt + current image latent -> [denoiser] -> unconditional noise prediction (latent_negative)
conditional + unconditional noise predictions -> [CFG] -> blended noise prediction (blended noise latent)
current image latent + blended noise latent + scaling factors from [Scheduler] -> [Sampler] -> cleaner image latent
This loop predicts the noise implied by the prompts, removes the appropriately scaled blended noise, produces a progressively cleaner image latent at each step.
A final clean image latent is produced at the end of the loop.
Finally the VAE Decoder converts the clean image latent back into a full‑resolution image:
final clean image latent -> [VAE Decoder] -> final image
Scheduler and Sampler work together in different styles.
Common samplers:
DDIM (deterministic, relatively fast, good for edits/consistency).
Euler / Euler a (ancestral) (simple, strong results, adds controlled randomness).
DPM‑Solver / DPM++ families (higher‑order ODE methods; very good quality‑speed tradeoff).
Why it matters:
Number of steps T: Fewer steps = faster, but can be less detailed.
Different samplers : different look/stability (some are sharper, some are smoother).
Different schedulers : e.g. Karras vs linear can improve detail or stability for the same step count.
What the denoiser really represents?
The denoiser is best thought of as "A conditional noise estimator across different noise levels".
It answers a very specific question: “Given how noisy this latent is right now, and given this text prompt, what noise should be removed?”
Originally the denoiser is trained on pairs of:
(noisy latent at noise level t) -> (known noise that was added)
Why the timestep t is given to the denoiser?
The same noisy latent value means different things at different noise levels.
Example:
At high noise -> most structure must be inferred
At low noise -> only small corrections are needed
Why CFG?
Without CFG:
Prompt following is often weak
Outputs drift toward generic samples
CFG exists to:
Strengthen the influence of the prompts without retraining the model or adding a classifier
CFG works because the same denoiser is trained to handle:
conditional input (positive prompt provided)
unconditional input (negative or empty prompt)
This allows the model to learn both:
“what good images look like in general”
“how the prompt changes that”
CFG extracts the difference between those behaviors.