This post aims to describe the general theory behind Diffusion models and their most basic applications in Image Generation. This post forms the foundation for a later series on Diffusion as it applies to Audio and Music generation.

- Theory
- Forward Process
- Reverse Process
- Training
- Sampling
- Optimizations in Practice
- Covariance Matrix
- Noise Prediction
- Noise Scheduling
- Architecture
- Further Discussion
- Further Reading

# Theory

Broadly speaking, the goal of generative models is to learn the distribution of the training data set. New data is generated by sampling the distribution learned from the underlying dataset.

Distributions are probabilistic in nature, and are learned by either maximizing the likelihood of a generation to fall within the training distribution, or minimizing a measure of divergence (referred to as error).

Diffusion is a modeling technique used for generation, a method inspired by statistical mechanics. It involves converting a base distribution to the target distribution in an iterative fashion. Output from a Markov Chain is used to approximate the learned distribution.

**What is a Markov Chain?**

A Markov Chain is a mathematical model used to represent a sequence of events or states where the probability of each event depends only on the state of the previous event. Here's a concise explanation of the key concepts:

- States: A Markov Chain consists of a set of possible states that a system can be in.
- Transitions: The chain moves from one state to another over time.
- Markov Property: The crucial characteristic is that the probability of moving to any particular state depends only on the current state, not on the sequence of states that preceded it.
- Transition Probabilities: These are the probabilities of moving from one state to another, usually represented in a matrix.
- Memoryless: The process is "memoryless" - the future state depends only on the present state, not the past states.

A simple example to illustrate the concept of a Markov Chain is weather prediction:

- States: Sunny, Rainy, Cloudy
- If it's Sunny today, there might be a 70% chance it's Sunny tomorrow, 20% chance it's Cloudy, and 10% chance it's Rainy.
- These probabilities only depend on today's weather, not on what the weather was like in previous days.

There are two main parts to a diffusion model: the *forward process* and the *reverse process*.

# Forward Process

- Taking a starting image $X_0$ and add small amounts of Gaussian noise.
- Repeat the process for $T$ steps, destroying the features of $X_0$.
- As $T\to\infty$, $X_T$ becomes purely random noise, losing all features.

# Reverse Process

- Goal of diffusion is to learn reverse denoising to undo the forward process, step by step
- Thus it appears to create new data from random input

But, learning the distribution of the training data isn’t easy, so instead we approximate it. This can be thought of as an iterative application of the Bayes Rule.

**What is the Bayes Rule, and how does it apply to diffusion models?**Bayes Rule, also known as Bayes' Theorem, is a fundamental principle in probability theory. It describes the probability of an event based on prior knowledge of conditions that might be related to the event.

The formula is:

$P(A|B) = P(B|A) * P(A) / P(B)$

Where:

- P(A|B) is the posterior probability of A given B
- P(B|A) is the likelihood of B given A
- P(A) is the prior probability of A
- P(B) is the marginal probability of B

In the context of diffusion models, Bayes Rule plays a crucial role in the denoising process. Here's how it applies:

a) Prior:

- This represents our initial belief about the distribution of clean images.
- In diffusion models, this is typically a learned distribution.

b) Likelihood:

- This represents the probability of observing a noisy image given a clean image.
- In diffusion models, this is defined by the forward noising process.

c) Posterior:

- This is what we're trying to estimate: the probability of a clean image given a noisy observation.
- This is used in the denoising process to gradually reconstruct the image.

d) Denoising as Bayesian Inference:

- The denoising process in diffusion models can be viewed as a series of Bayesian inference steps.
- At each step, we're using Bayes Rule to estimate the most likely less-noisy image given the current noisy image.

e) Reverse Process as Bayesian Inference:

- The reverse process in diffusion models, which generates images from noise, can be seen as iterative application of Bayes Rule.
- We start with a very noisy image (almost pure noise) and gradually apply Bayes Rule to estimate less noisy versions.

There are mathematical reasons we can assume the reverse process step value also fits a Gaussian distribution, but we won’t get into that here.

Put simply, Diffusion is the process of trying to learn this mean and variance through a neural network. A loss function can be calculated for Diffusion using the variational lower bound. $L_T$ almost always approaches 0, so $\sum_{0}^{T-1} L_N \approx L_{TOTAL}$.

Instead of predicting the *mu, *Ho et al. say we should predict epsilon instead for computational expediency. For more information, get nerd sniped by the below callout.

**Mu and Epsilon (in the context of training Diffusion Models)**

In the context of diffusion models, μ and ε refer to different aspects of the noise prediction process:

- ε (epsilon):
- ε represents the noise that was added to the image at a particular timestep.
- The model is typically trained to predict this noise.
- By learning to predict the noise, the model can later reverse the process during generation.
- μ (mu):
- μ represents the mean of the distribution from which the next (less noisy) image is sampled.
- It can be derived from the predicted noise (ε) and the current noisy image.

The relationship between predicting ε and μ:

- Direct noise prediction:
- The model is trained to predict ε directly.
- This prediction tells us what noise was likely added to get from the original image to the current noisy version.
- Deriving μ from ε:
- Once we have the predicted ε, we can use it to calculate μ.
- μ essentially represents our best guess at what the less noisy image should look like.
- Mathematical relationship:
- $μ = (x_t - √(1-α_t) * ε_θ(x_t, t)) / √α_t$
- Where $x_t$ is the noisy image at timestep $t$, $α_t$ is a scheduling parameter, and $ε_θ$ is the predicted noise.

The advantage of predicting ε instead of μ directly:

- It's often easier for the model to learn to predict the noise rather than the denoised image directly.
- Predicting ε allows for more stable training and better results in practice.

During the generation process, the model uses these predictions to gradually denoise random noise into a coherent image.

## Training

Applying our existing knowledge of the architecture, an algorithm for training Diffusion models might look something like this:

- Take a clean image from the original dataset
- Add noise to the image until it loses all features and becomes Uniform
- Predict the noise that was added to the original image
- Minimize the distance between the predicted noise and the actual noise
- Repeat until the loss converges at a constant value.

Now, how do we use the trained model in order to synthesize new images?

## Sampling

- Feed in an image of completely random noise.
- Predict a slightly less noisy iteration of the image.
- Subtract the predicted noise from the image.
- Repeat until the image converges.

Now we’ve covered the basic theory of Diffusion!

# Optimizations in Practice

## Covariance Matrix

Authors of the original paper used a fixed covariance matrix. Intuitively, this makes sense: covariance contributes much less to the reverse process than the mean. Later researchers pointed out that including a modified covariance term improves likelihood estimates while preserving quality of the image generations.

Researchers and practitioners often use log-likelihood as one of several metrics to gauge model improvement. It's typically considered alongside other metrics like FID (Fréchet Inception Distance) for a more comprehensive evaluation.

**Impacts of improved Log Likelihood**

Improving the log-likelihood during the training of a diffusion model can have significant impacts on the model's performance and capabilities. Let's break this down:

- The log-likelihood is a measure of how well the model fits the training data.
- Higher log-likelihood indicates that the model assigns higher probability to the observed data.

Impacts of Improved Log-Likelihood include better data representation, more efficient sampling, and better compression, but do not always represent a perceptual improvement in output quality. More interesting are the benefits to downstream tasks:

- An improved log-likelihood suggests that the model has learned a more accurate probability distribution of the data.
- In some cases, improved log-likelihood can lead to faster sampling or the ability to use fewer denoising steps while maintaining quality.
- For tasks that use the diffusion model as a component (e.g., image inpainting, super-resolution), a better log-likelihood often translates to improved performance.

Of course, every engineering and business decision also imposes trade-offs:

- Some models might have high log-likelihood but still produce less visually appealing samples.
- Extremely high log-likelihood on the training set without corresponding improvement on a validation set might indicate overfitting.
- Improving log-likelihood often requires more complex models or longer training times, which comes with computational costs.

## Noise Prediction

Diffusion often uses a heavily modified U-Net model for noise prediction. If you pass an image into a U-net model, you receive a segmentation map as the output. This is an artifact that can be used to classify the pixels, but instead we use it to predict the change in noise between steps in the Diffusion process. See the callout for more information.

**U-Net Modifications for Diffusion Noise Prediction U-Net models**, when applied to Noise Prediction in Diffusion, are heavily modified to include an array of modern Computer Vision techniques including but not limited to:

- Positional Embeddings
- ResNet Blocks
- ConvNext Blocks
- Attention Modules (which you may be familiar with from Transformers)

…and a lot more things I don’t have time to explain here. Most important takeaway is that an older architecture from an adjacent discipline was repurposed to orchestrate an *ensemble method* that works better.

## Noise Scheduling

Noise schedule is the rate at which noise is added to an image during the forward process of Diffusion. In a linear schedule, noise is added at a constant rate. As a result, the initial data is converted to noise really quickly, which increases the learning difficulty of model training.

Researchers have experimented with other arbitrary methods, such as cosine schedule, which incurs a gradual rate of change at either extreme of the cycle’s phase. Results included an easier learning curve resulting in more effective training, but other methods of noise scheduling could show promise still.

## Architecture

Here are just a few of the changes proposed by Nichol and Dhariwal to improve training:

- Horizontal scaling of model width provides computationally cheaper scaling
- Adding more attention heads and applying them to different resolutions
- Incorporating
*Adaptive Group Normalization*to improve timestep

These architectures evolve constantly, almost weekly, so expect to see variation in the field.

# Further Discussion

- Could Diffusion be used to perform noise-based generation on assets other than Images?