HOW GENERATIVE AI REALY WORKS - A PRIMER

When we talk about "AI", we're talking about generative programs that use something called a "neural network" to get generative, algorithmic outputs. Image generators aren't just one, but actually three (and sometimes more!) neural networks all working together in tandem to identify and extrapolate patterns between random noise and words.

Image generators work by analyzing random noise generated through a Gaussian algorithm - visualized, it looks like old TV static if you've ever seen that. The analysis is guided by a text prompt; the AI looks at the noise and identifies parts of the noise that look like concepts from the text prompt, eerily similar to the way humans will identify patterns in clouds. This is our first neural network, usually just referred to as CLIP, although that name is more correctly attributed to a specific version developed by OpenAI; other image generators use re-implementations of this concept, such as OpenCLIP by StabilityAI.

Once concepts from the prompt are "identified" in the noise by CLIP, the generator does something called "denoising" where it imposes some order onto the noise to make the noise appear more strongly as the concept that was identified; it's like if we saw a shape that looked like a star in the clouds, and then we were able to shape the cloud to look even more like a star, but only just a little bit. This is our second neural network, usually just referred to as "U-Net" for the shape of the network, although you can also think of it as the "denoiser" because it is trained to take random noise and put it into a more ordered state. This is the part of the image generator that "draws", although that analogy is a bit stretched.

After one pass of this identification and then denoising process, it starts over again, except this time in place of the random noise, the generator is served its own modification of the noise that has the reinforced patterns from the denoiser. CLIP looks again, the denoiser reinforces the patterns it finds... again, and again, usually between 20 and 50 times; each time around is referred to as a "step".

Once the last step is done, you have your completed image... except, there's a problem. There's no image yet!

A raster is a grid where each square of the grid contains information. An image file is usually a raster, and the squares in the grid are what we call "pixels"; the information contained in each pixel is a color value. You are probably familiar with this!

So far, our picture is NOT a raster. Instead, it exists as complicated high-dimensional math. In this state, it is said to exist in "latent space", the image exists as "latents", and the noise it's working on is called "latent noise". The attention mechanism and the mechanism that orders the noise by "denoising" are really just identifying patterns in numbers and then pushing numbers around to more consistently represent that pattern it's extrapolating; the patterns being recognized and extrapolated are patterns in the latent mathematics.

To get a raster from this, we need one more neural network that has been trained on re-interpreting latents into a raster. It's time to meet our last network, the Variational Autoencoder, or "VAE"! The VAE can actually do this in reverse, too; it can take a raster and encode it into high-dimensional math for CLIP and U-net to do their thing. If you've ever used an AI to put an anime "filter" onto a picture, VAE is how your picture was passed off into latent space to get worked on. Then, VAE is how your picture got turned back into a raster for you to look at.

Some folks still think: Obviously that complicated math from the latent space is actually just pieces of images! STOLEN pieces!

That would be incorrect.

It's true that to gain an understanding of these patterns and do their thing, these neural networks need to be pre-trained on existing images and captions - pre-trained as in, it happens long before I ask it to generate an image. But the conceptual understanding of patterns that the attention mechanism and denoiser have aren't pieces of images, or even re-expressed pieces of images, they are a mathematical representation of the concepts represented by the text, and even understood on several hierarchal levels. Thanks to CLIP, image generators understand the complex semantic interactions between words and visual information and because U-Net is working from random noise, it's not gonna reproduce the same patterns it was trained on when it was a baby denoiser denoising existing images. If it was just using pieces of existing images, image generators wouldn't be able to recombine concepts in novel ways that don't exist in the dataset. Van-Gogh never painted Kermit the frog, but I can prompt for Kermit in the style of Van-Gogh because our generator has an understanding of the very concept of Van-Gogh-ness!

When we talk about stealing images, usually we're talking about republishing an image without permission, like on a different website or on unauthorized prints. Downloading images from the public internet isn't stealing, in fact we need to download them into our cache to see them at all. If your images are available on the public internet for anyone to look at, you don't have a reasonable expectation that the "robots" aren't gonna look at them, too, and as long as the robots aren't republishing them, there's no case to be made that it's stealing.Â