AI Image Generation: Stable Diffusion Overview

Beginner 10 min read

What you'll learn

✓What diffusion models actually do
✓Why latent diffusion is so efficient
✓The components: VAE, U-Net, text encoder
✓How sampling and guidance shape the image
✓Practical tips for prompts and fine-tuning

Prerequisites

•Basic deep learning familiarity

Stable Diffusion was the moment image generation went from research toy to weekend project. It runs on consumer GPUs, the weights are open, and the ecosystem of fine-tunes and tools is enormous. This post explains how it works and how to use it well.

What and Why

Stable Diffusion is a latent diffusion model. Given a text prompt, it produces an image by starting from random noise and iteratively denoising it. The clever part is that all the denoising happens in a compressed latent space, not at the pixel level. That makes it fast and memory-friendly.

Why care? Because it puts powerful generation in your hands without an API bill, lets you fine-tune on your own data, and runs on a single mid-range GPU. The same architecture underlies most modern open image models.

Mental Model

A diffusion model learns to reverse a noising process. During training, real images are progressively corrupted with Gaussian noise across many steps. A neural network learns to predict, at each step, what noise was added. At inference, you start from pure noise and run the network backwards, peeling off noise step by step, conditioned on a text prompt.

Stable Diffusion adds two key tricks. A variational autoencoder (VAE) compresses 512x512 images into a 64x64 latent grid, so the diffusion process runs on a 64x smaller tensor. A frozen text encoder, originally CLIP, turns the prompt into embeddings that condition the denoiser at every step.

Hands-on Example

A minimal generation run with the Hugging Face Diffusers library:

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
).to("cuda")

image = pipe(
    prompt="a watercolor painting of a quiet harbor at dawn",
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

image.save("harbor.png")

Under the hood, the pipeline samples noise, encodes the prompt, and runs the U-Net for 30 denoising steps. Each step predicts noise, the scheduler updates the latent, and at the end the VAE decoder turns the final latent into pixels.

text prompt
   |
   v
[text encoder (CLIP)]
   |
   v
 text embeddings
                        random latent (64x64x4)
                              |
                              v
 +----------------- denoising loop (e.g. 30 steps) -----------------+
 |                                                                  |
 |   latent + text emb -> [U-Net] -> predicted noise -> scheduler   |
 |   update latent <-------------------------------------------+    |
 +------------------------------------------------------------------+
                              |
                              v
                       final latent
                              |
                              v
                      [VAE decoder]
                              |
                              v
                       512x512 image

Stable Diffusion sampling pipeline

The guidance_scale parameter blends the model’s conditional and unconditional predictions. Higher values pull harder toward the prompt and away from the noise distribution; too high and images look oversaturated and artifacted.

Trade-offs

More sampling steps cost more time and give marginally better images. Modern schedulers like DPM-Solver++ deliver good quality in 20 to 30 steps; default Euler can need 50. Pick a scheduler before tuning steps.

Higher guidance scales give prompt fidelity at the cost of natural look. A scale of 7 to 9 is a sweet spot for SD 1.5 and 2.1. SDXL prefers lower values around 5 to 7.

Bigger models (SDXL, SD3) give better composition and text rendering but eat more VRAM and run slower. A 24 GB GPU is comfortable; an 8 GB GPU forces tiled VAE decoding and offloading.

Fine-tuning options range from full fine-tunes (expensive) to LoRA adapters (a few hundred megabytes) to textual inversion (a few kilobytes). LoRAs are the sweet spot for most users.

Practical Tips

Write prompts as comma-separated concepts, not full sentences. The CLIP text encoder was trained on alt text, not prose. “Old library, dust motes, golden hour, oil painting” usually beats “An old library where dust motes float in the golden hour light, painted in oils.”

Use negative prompts to suppress what you do not want: “blurry, extra fingers, watermark, low quality”. They are as influential as positive prompts.

Fix the seed when iterating on a prompt. Random seeds make it impossible to tell whether your prompt change helped or you just got luckier.

For consistent characters or styles, train a LoRA on 10 to 30 images rather than wrestling with prompts. It is faster and gives much better results.

Quantize to fp16 or bf16 for inference. The quality cost is negligible and the memory savings are huge. For tight VRAM, look at int8 or fp8 quantization of the U-Net.

Wrap-up

Stable Diffusion’s combination of a compressed latent space, a U-Net denoiser, and text conditioning is now the template for most open image models. Once you have the mental model, the knobs make sense: steps trade quality for speed, guidance trades fidelity for naturalness, and fine-tuning trades effort for control. Spend an afternoon generating, an evening reading schedulers, and a weekend training a LoRA, and you will know enough to ship.