How diffusion models generate images

Diffusion models generate data by learning to reverse a process that gradually destroys it. Rather than map noise to an image in one leap, an unstable problem, they break generation into many small denoising steps; the model only ever learns to remove a little noise, and sampling applies that one skill hundreds of times (Ho et al. 2020). That decomposition into gentle steps is why diffusion is both stable to train and capable of high quality.

Keep two processes separate. The forward process is fixed and only adds noise, so it needs no learning. The reverse process is learned and removes noise one step at a time. The whole model is a single network trained to predict the noise in a corrupted input.

The forward process

Define a variance schedule $\beta_1,\dots,\beta_T$ and let $\alpha_t = 1-\beta_t$, $\bar\alpha_t = \prod_{s\le t}\alpha_s$. Each step adds Gaussian noise, $q(x_t\mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\,x_{t-1},\,\beta_t I)$. The reason this is convenient is that the steps compose in closed form, so any noise level is reachable in one shot:

$$q(x_t \mid x_0) = \mathcal{N}\!\big(\sqrt{\bar\alpha_t}\,x_0,\;(1-\bar\alpha_t)\,I\big), \qquad x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\;\epsilon, \quad \epsilon \sim \mathcal{N}(0,I).$$

Without that closed form you would have to simulate the entire chain to make one training example; with it, you sample a random timestep and corrupt the image in a single operation.

The reverse process and objective

The reverse step is also Gaussian for small $\beta_t$, and the network predicts its parameters. The key simplification is to predict the noise $\epsilon$ rather than the clean image, reducing the variational bound to a plain mean-squared error,

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0,\,t,\,\epsilon}\big\|\,\epsilon - \epsilon_\theta(x_t, t)\,\big\|^2.$$

Why predict noise: the target then has roughly unit variance at every timestep, so the loss is well-scaled across the whole schedule, whereas predicting $x_0$ directly is easy at low noise and nearly impossible at high noise. This view also connects to score matching: predicting the noise is, up to a scale, estimating the gradient of the log-density $\nabla_x \log q(x_t)$, which the continuous stochastic-differential-equation formulation makes precise and unifies with the discrete chain (Song et al. 2021). And because the objective is a regression with no discriminator, training is far more stable than a GAN.

Sampling: DDPM, DDIM, and fast samplers

Generation starts from pure noise and applies the denoiser repeatedly. The original DDPM sampler takes many small stochastic steps, faithful but slow, often hundreds to a thousand network evaluations. DDIM (Song et al. 2021) reinterprets sampling as a deterministic probability-flow ODE, which lets you skip steps and generate in twenty to fifty, trading a little diversity for a large speedup. The EDM framework (Karras et al. 2022) cleans up the design space and improves both, and consistency models (Song et al. 2023) learn a direct map from noise to data for one or few-step generation. The why behind all this effort is simple: each step is a full forward pass through a large network, so halving steps nearly halves cost.

Conditioning and guidance

Unconditional diffusion offers no control; a text prompt is injected into the denoiser through cross-attention. To make the model adhere more strongly, classifier guidance steered samples with a separate classifier's gradient (Dhariwal & Nichol 2021), but the standard now is classifier-free guidance (Ho & Salimans 2022): train the model both with and without the prompt, then at sampling time extrapolate,

$$\hat\epsilon = \epsilon_\theta(x_t, t, \varnothing) + s\,\big(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)\big),$$

where the guidance scale $s > 1$ amplifies the prompt's effect at some cost to diversity. The easily forgotten knob here is the noise schedule itself: a cosine schedule spends more steps at useful noise levels than the original linear one and improves samples (Nichol & Dhariwal 2021).

Latent diffusion and architectures

Running diffusion on raw pixels is expensive because every step touches a large array. Latent diffusion (Rombach et al. 2022), the basis of Stable Diffusion, first compresses images into a smaller latent with a pretrained autoencoder, runs the whole process there, and decodes only at the end, often an eightfold reduction per spatial dimension. The denoiser backbone was historically a U-Net but is increasingly a transformer: DiT (Peebles & Xie 2023) shows diffusion transformers scale cleanly, and Stable Diffusion 3's MMDiT (Esser et al. 2024) pairs a transformer with a rectified-flow objective, a straighter-path reformulation of diffusion that needs fewer sampling steps. Text-to-image systems like Imagen (Saharia et al. 2022) showed a strong text encoder matters as much as the diffusion model.

Step by step

Fix a noise schedule and precompute the cumulative products $\bar\alpha_t$.
Forward: jump any image to noise level $t$ in one closed-form step.
Train the network to predict the added noise with an MSE loss.
Reverse: each step subtracts the predicted noise and adds a little back.
Sample by denoising from $t=T$ down to $0$, optionally with guidance.

import numpy as np

T = 1000
betas = np.linspace(1e-4, 0.02, T)            # noise schedule
alphas = 1.0 - betas
alpha_bar = np.cumprod(alphas)

def q_sample(x0, t, noise):                    # forward: jump to noise level t
    return np.sqrt(alpha_bar[t]) * x0 + np.sqrt(1 - alpha_bar[t]) * noise

def training_loss(model, x0):
    t = np.random.randint(T)
    noise = np.random.randn(*x0.shape)
    xt = q_sample(x0, t, noise)
    return ((model(xt, t) - noise) ** 2).mean()    # predict the noise

def p_sample_step(model, x, t):                # reverse: one denoising step
    eps = model(x, t)
    mean = (x - betas[t] / np.sqrt(1 - alpha_bar[t]) * eps) / np.sqrt(alphas[t])
    return mean + (np.sqrt(betas[t]) * np.random.randn(*x.shape) if t > 0 else 0)

def sample(model, shape):
    x = np.random.randn(*shape)                # start from pure noise
    for t in reversed(range(T)):
        x = p_sample_step(model, x, t)
    return x

Complexity (time and space)

Cost is dominated by sampling: one image is (number of steps) times (one forward pass), so DDPM at 1000 steps is roughly 20 to 50 times more expensive than DDIM at 20 to 50, and consistency models push toward a single step. Training cost scales with network and dataset like any deep model, but each step is cheap because the target, the added noise, is free. Latent diffusion cuts both by working on arrays about eightfold smaller per side, roughly a 64-fold reduction in elements per step.

Worked example

The forward closed form scales the signal by $\sqrt{\bar\alpha_t}$ and the noise by its complement. As $t$ grows the signal fades to almost nothing, which is the near-pure-noise state sampling starts from:

import numpy as np
T = 1000
betas = np.linspace(1e-4, 0.02, T)
alpha_bar = np.cumprod(1 - betas)

for t in [0, 50, 200, 600, 999]:
    print(f"t={t:4d}  signal={np.sqrt(alpha_bar[t]):.3f}  noise={np.sqrt(1-alpha_bar[t]):.3f}")

# t=   0  signal=1.000  noise=0.010
# t=  50  signal=0.985  noise=0.173
# t= 200  signal=0.810  noise=0.586
# t= 600  signal=0.160  noise=0.987
# t= 999  signal=0.006  noise=1.000

Follow-up questions

Why predict the noise instead of the image? The noise target has roughly unit variance at every timestep, so the loss is well-scaled across the schedule, unlike predicting $x_0$, which is trivial at low noise and hopeless at high noise.
How does the closed-form forward step help training? It lets you corrupt an image to any timestep in one operation, so you never simulate the full chain to make a training example.
What is the link to score matching? Predicting the noise is, up to a scale, estimating $\nabla_x \log q(x_t)$, which the SDE formulation makes exact.
DDPM vs DDIM, and why fewer steps? DDPM is a stochastic chain; DDIM is the deterministic probability-flow ODE, which can be integrated with larger steps, and each step is a costly forward pass.
What does classifier-free guidance trade off? Extrapolating between conditioned and unconditioned predictions ($s > 1$) strengthens prompt adherence at the cost of sample diversity.
Why latent diffusion? Diffusing in a compressed latent instead of pixels cuts per-step cost by a large factor, making high-resolution generation affordable.

References

Ho et al., Denoising Diffusion Probabilistic Models (DDPM, 2020).
Song et al., Denoising Diffusion Implicit Models (DDIM, 2021).
Song et al., Score-Based Generative Modeling through SDEs (2021).
Nichol & Dhariwal, Improved Denoising Diffusion Probabilistic Models (2021).
Dhariwal & Nichol, Diffusion Models Beat GANs (classifier guidance, 2021).
Ho & Salimans, Classifier-Free Diffusion Guidance (2022).
Rombach et al., High-Resolution Image Synthesis with Latent Diffusion Models (2022).
Saharia et al., Imagen (2022).
Karras et al., Elucidating the Design Space of Diffusion Models (EDM, 2022).
Peebles & Xie, Scalable Diffusion Models with Transformers (DiT, 2023).
Song et al., Consistency Models (2023).
Esser et al., Scaling Rectified Flow Transformers (Stable Diffusion 3, 2024).