(1) Why ControlNet exists?
Most text-to-image (and text-to-video) diffusion models are trained on large, generic datasets. They can follow text prompts, but text is a weak way to control exact structure.
Typical “structure” problems text alone can’t reliably lock down:
exact pose (hands/arms/legs placement)
exact composition (where objects sit in the frame)
exact outlines (matching a sketch/edges)
depth or segmentation constraints
frame-to-frame structure consistency in video
ControlNet is a way to add explicit structure guidance (pose/edges/depth/seg/line art/etc.) on top of an existing diffusion model, without retraining the whole base model.
(2) What ControlNet is?
ControlNet is conceptually an extra conditioning branch attached to the diffusion model’s denoiser (commonly the U-Net). It takes a “control map” (pose skeleton, edge map, depth map, etc.) and produces guidance signals that are injected into the base U-Net’s internal feature flow.
Key mental model:
Text prompt = “what”
Control map = “where / layout / skeleton”
Base model still does the actual denoising; ControlNet steers it to respect structure. E.g. Canny edges produce skeletons for objects.
ControlNet is not “just storing edges”. It learns how to convert a simple control signal into the right kind of push at different parts (early to late stage) of the denoising flow.
(3) What goes in and what comes out (input/output)
ControlNet is not usually used standalone. It works together with the base U-Net denoiser model.
Inputs (per denoising step):
z_t: the current noisy latent being denoised
t: diffusion timestep (for getting the noise level index from Scheduler, NOT for video time)
text_condition: text embeddings (used by base U-Net; sometimes also used by ControlNet depending on implementation)
control_map: a structured condition (edges/pose/depth/seg/line art/etc.)
Output:
not an image
not a noise prediction directly (in most designs)
it outputs a set of feature “residuals” (delta features) that get added into matching U-Net blocks.
So the final noise prediction still comes from the base U-Net, but it is influenced/guided by the injected residuals.
i.e. giving U-Net denoiser a skeleton to follow, in order to produce a certain structure.
(4) Where ControlNet is “plumbed in” (flow chart)
Base diffusion denoiser (simplified):
base U-Net predicts noise from (z_t, t, text_condition)
With ControlNet:
ControlNet produces residual feature maps that are added into multiple U-Net blocks.
Plain flow diagram:
Inputs:
text prompt -> text encoder -> text_condition
control image -> preprocessor -> control_map
initial noise -> latent z_T
Denoising loop (each step):
z_t, t, text_condition -----------------------> base U-Net -----------------> noise_prediction
^
|
injected residuals (delta features)
|
z_t, t, control_map ---> ControlNet -------> + (adds into U-Net blocks)
Sampler uses noise_prediction to update z_t -> z_(t-1)
Important detail: ControlNet injection is usually done at multiple “depths” (early/mid/late blocks), not just one place. That’s how it influences both coarse layout and fine alignment.
preprocessor could be e.g. a canny edge detector that converts raw control images into edge maps.
(5) Why ControlNet uses “original block sizes” (why not smaller?)
The skeletons/edges are simple and “small” information, so why does ControlNet look big and match the U-Net?
Because ControlNet must output tensors that can be added into the U-Net’s internal tensors.
Inside a U-Net, each block processes feature tensors like:
shape roughly: (channels, height, width)
ControlNet outputs delta features that are injected by addition:
base_features = base_features + strength * delta_features
To add them, delta_features must have the SAME shape (same channels and spatial size) as base_features at that block.
So ControlNet often mirrors the U-Net’s multi-scale structure to produce:
a delta feature map for each relevant block
at the correct spatial resolution and channel count
Even if the control input is “1-channel edges”, the guidance required is not 1-channel.
The base U-Net uses rich, high-dimensional features; ControlNet must generate guidance in that same feature space.
Note:
There are lightweight variants / adapters, but standard ControlNet keeps shape-compatible capacity because it’s the most direct way to inject multi-scale guidance.
(6) Why timestep t is included (and what “time” means here)
It means “how noisy the latent is right now” (noise level / denoising step index).
Why t matters:
Early steps (high noise): there is almost no structure yet
-> control should mostly guide global layout / pose / composition
Late steps (low noise): structure exists; only refinement is needed
-> control should mostly guide precise alignment (edges, contours, small corrections)
So ControlNet needs (z_t AND t) to decide:
how strongly to enforce control at this noise level
what kind of corrections are appropriate at this stage
(7) Does ControlNet “change model behavior”? Can it be turned on/off?
Yes, ControlNet changes behavior, but differently from LoRA.
LoRA changes behavior by modifying computations via learned weight deltas.
ControlNet changes behavior by adding extra conditioning signals (delta features) into the forward pass.
Turning on/off:
If control strength = 0 (or ControlNet disabled), injected residuals are effectively zero -> model behaves like the base model.
Increasing control strength makes the output adhere more strictly to the control_map.
Many pipelines also allow a “schedule”:
stronger control early steps, weaker later steps (or the opposite), depending on the control type.
(8) How ControlNet is trained?
Training data typically includes:
an image
its text caption (or prompt)
a control_map derived from that image (pose, edges, depth, segmentation, etc.)
Training approach:
base model is usually frozen (or mostly frozen)
ControlNet parameters are trained so that, given (control_map, z_t, t), it outputs residuals that steer the base U-Net to produce noise predictions consistent with the original image.
One important design trick:
ControlNet often starts with “zero contribution” so it doesn’t break the base model initially.
This is commonly done via layers initialized so residual output starts near zero, then training learns to “turn on” the control influence.
Result:
the base model keeps its general capability
ControlNet learns “how to enforce structure”
(9) Skeleton “movement” and joint constraints
For video:
“movement” comes from you providing a sequence of control maps over frames:
frame1_pose, frame2_pose, ... frameN_pose
ControlNet conditions each frame (or each frame’s denoising process) on that frame’s pose map
So motion is driven by input control maps per frame.
Can ControlNet enforce biomechanical constraints (like “hand can only move in certain angles”)?
Not explicitly like a physics engine or kinematic solver.
However, it can follow the skeleton/pose map you give it.
It may produce plausible anatomy because the base model learned strong priors from real data.
If you need strict joint-angle constraints:
enforce them BEFORE ControlNet (pose generation/cleanup step)
e.g., use a rig/IK system, clamp angles, smooth trajectories, then feed the corrected pose maps into ControlNet
So the constraint is usually handled upstream; ControlNet is the follower, not the rule-set.
10) Compatibility (similar to LoRA)
ControlNet is also architecture-specific:
A ControlNet trained for SD1.5 is not directly compatible with SDXL (layer shapes and block structure differ).
It works best with the intended base family/checkpoint.
Short recap
ControlNet adds a parallel branch that turns a control_map (pose/edges/depth/seg) into multi-scale delta feature maps. These deltas must match the U-Net’s internal tensor shapes because they’re injected by addition into U-Net blocks. Timestep t is included because it represents noise level; the “right” control influence differs early vs late denoising. Skeleton movement is not diffusion time; it comes from per-frame control maps in video. Joint limits aren’t explicitly enforced; that’s usually an upstream kinematics/rigging job.