LoRA low rank adaption

LoRA (low rank adaption) for generative AI models

For most existing text-to-image or text-to-video generative models, the base model is trained on very large, generic datasets. This gives broad capability, but if you want specialized effects, for example sharper portrait rendering, smoother motion, or a very specific style, it’s expensive to fully fine-tune the entire large model.

A more efficient approach is LoRA (Low-Rank Adaptation).

(1) What LoRA is (and what it is not)

LoRA is not a standalone “new model”

It is a small set of extra weights that are attached to a specific base model architecture (and usually a specific model family/checkpoint), e.g., SDXL vs SD1.5.

Its purpose is to change the behavior of the base model in a controlled, reversible way.

Because LoRA is trained against a particular architecture (and often a particular base checkpoint), it generally works best on that same model family.

(2) The core mechanism: a “side path” added to a layer

Consider a linear layer in the base model:

y = Wx

LoRA adds a parallel “side path” that produces a delta output which is added back to the original output:

y = Wx + αBAx

Here:

x is the layer input activation

y is the layer output activation

W is the frozen pretrained weight matrix of the original big model

A and B are the LoRA adapter weights

α alpha is a scaling factor that controls LoRA strength

A simple diagram:

Base path:

x ----------> W ----------> (Wx) -------------------

| --> y

LoRA side path: |

x ----------> A (down to r) --> B (up) --> α(BAx) --

Important: LoRA usually does not overwrite W during inference.

It adds an extra contribution (and can be turned off by setting α = 0).

(3) Why the “down to r then back up” exists

LoRA constrains the update to be low-rank:

A maps from the original dimension down to a small matrix of size r

B maps from that small matrix back up to the output dimension

So:

A has shape (r,din)

B has shape (dout,r)

This forces the model’s change to that layer to lie in at most r independent directions.

A helpful analogy is an autoencoder-style bottleneck structurally, but conceptually it’s different:

Autoencoder bottleneck: compresses data to represent the data

LoRA bottleneck: compresses the update so the model can only change in a few key directions.

This bottleneck (small matrix r) often captures the most significant update directions needed for the target dataset (e.g., a portrait aesthetic), without needing to relearn the entire base model.

(4) Where do A, B, and α come from?

A and B are new parameters introduced by LoRA (they don’t exist in the base model).

During LoRA training, the base model weights W are frozen, and only A and B are trainable.

α alpha is a strength control:

often stored as training metadata and/or adjusted at inference (a “LoRA weight” slider)

(5) How LoRA is trained (high level)

To train a LoRA for a desired effect:

Choose a base model (e.g., SDXL).

“Plumb” LoRA adapters into selected layers of the base model.

Freeze the base model weights W.

Train only the LoRA weights A and B on a dataset that represents the target effect/style.

So LoRA learns values of A and B such that, across training examples, adding αBAx helps the model produce outputs that match that dataset.

In short: LoRA learns a compact behavior patch on top of the frozen base model.

(6) Choosing r (rank): what happens if r is big?

Smaller r: fewer parameters, stronger constraint, often better regularization; may miss fine details if too small.

Larger r: more capacity to learn complex updates; larger file and compute; higher risk of overfitting or overpowering the base model.

So the tradeoff is capacity vs compactness/regularization, not “generalize too well.”

(7) Why LoRA files are small (but not “just two matrices”)

A common misconception is “LoRA is only two matrices A and B.”

That’s true per adapted layer, but a practical LoRA attaches to many layers, so it stores many A/B pairs, each pair per layer.

Even so, it’s usually much smaller than the full model because each pair is low-rank. That’s why LoRAs are often tens to hundreds of MB compared to GBs for base models.

(8) Where LoRA is typically applied in diffusion systems

LoRA can be attached to different layer types, depending on what you want:

U-Net / denoiser attention layers (very common): strong effect on style/identity/detail

Text encoder layers (optional): can improve trigger reliability / embedding behavior

Sometimes conv/MLP layers (depends on training setup)

But whichever layers you plumb it into, you must train with those adapters present, and the resulting LoRA is generally compatible only with the intended base model family/architecture.

Final recap

LoRA is a small, trainable side branch added to selected layers of a frozen base model. For a layer with base output Wx, LoRA adds a low-rank residual αBAx. It’s trained on a target dataset by freezing the base model and updating only A and B. Different LoRAs correspond to different learned “behavior patches,” such as a portrait look or a particular style.

Page updated

Google Sites

Report abuse