All Courses

The Evidence Lower Bound (ELBO) Formulation

As we established in the chapter introduction, Variational Autoencoders (VAEs) are derived using the principles of variational inference. Our ultimate goal in generative modeling is often to estimate the probability distribution of the observed data, $p(x)$ . For models with latent variables $z$ , this involves calculating the marginal likelihood:

p(x) = \int p(x,z) dz = \int p_\theta(x|z)p(z) dz

where $p(z)$ is the prior distribution over the latent variables and $p_\theta(x|z)$ is the likelihood of the data given the latent variables, typically parameterized by a decoder network with parameters $\theta$ .

However, this integral is frequently intractable for complex models and high-dimensional latent spaces. This intractability extends to the true posterior distribution $p(z|x) = p(x|z)p(z) / p(x)$ , as its denominator $p(x)$ is the very integral we cannot compute. Variational inference addresses this by introducing an approximation to the true posterior, denoted as $q_\phi(z|x)$ . This approximate posterior is typically parameterized by an encoder network with parameters $\phi$ .

The core idea is to make $q_\phi(z|x)$ as close as possible to the true posterior $p(z|x)$ . We measure this "closeness" using the Kullback-Leibler (KL) divergence, $D_{KL}(q_\phi(z|x) || p(z|x))$ . Our objective is to find parameters $\phi$ that minimize this KL divergence.

Let's begin with the log-likelihood of the data, $\log p(x)$ , and see how $q_\phi(z|x)$ and the Evidence Lower Bound (ELBO) emerge.

\log p(x) = \log \int p(x,z) dz

We can multiply and divide by $q_\phi(z|x)$ inside the integral (assuming $q_\phi(z|x) > 0$ where $p(x,z) > 0$ ):

\log p(x) = \log \int q_\phi(z|x) \frac{p(x,z)}{q_\phi(z|x)} dz

This can be rewritten as the logarithm of an expectation with respect to $q_\phi(z|x)$ :

\log p(x) = \log E_{q_\phi(z|x)} \left[ \frac{p(x,z)}{q_\phi(z|x)} \right]

Since the logarithm is a concave function, we can apply Jensen's inequality ( $\log E[Y] \ge E[\log Y]$ ) to move the logarithm inside the expectation:

\log p(x) \ge E_{q_\phi(z|x)} \left[ \log \frac{p(x,z)}{q_\phi(z|x)} \right]

This lower bound is precisely the Evidence Lower Bound (ELBO), often denoted as $L_{ELBO}$ or simply $\mathcal{L}(\phi, \theta; x)$ :

L_{ELBO}(\phi, \theta; x) = E_{q_\phi(z|x)} [ \log p(x,z) - \log q_\phi(z|x) ]

By expanding $p(x,z) = p_\theta(x|z)p(z)$ , we get another common form:

L_{ELBO}(\phi, \theta; x) = E_{q_\phi(z|x)} [ \log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x) ]

The difference between the true log-likelihood $\log p(x)$ and the ELBO is exactly the KL divergence between the approximate posterior and the true posterior:

\log p(x) - L_{ELBO}(\phi, \theta; x) = E_{q_\phi(z|x)} \left[ \log q_\phi(z|x) - \log p(z|x) \right] = D_{KL}(q_\phi(z|x) || p(z|x))

So, we have the fundamental relationship:

\log p(x) = L_{ELBO}(\phi, \theta; x) + D_{KL}(q_\phi(z|x) || p(z|x))

Since the KL divergence is always non-negative ( $D_{KL} \ge 0$ ), the ELBO is indeed a lower bound on the log-likelihood of the data. Maximizing the ELBO with respect to $\phi$ and $\theta$ serves two purposes:

It pushes the ELBO closer to the true log-likelihood, effectively minimizing the KL divergence $D_{KL}(q_\phi(z|x) || p(z|x))$ , thereby making our approximate posterior $q_\phi(z|x)$ a better approximation of the true posterior $p(z|x)$ .
It indirectly maximizes the log-likelihood $\log p(x)$ of our model generating the observed data.

The log marginal likelihood $\log p(x)$ is decomposed into the Evidence Lower Bound (ELBO) and the KL divergence between the approximate posterior $q_\phi(z|x)$ and the true posterior $p(z|x)$ . Maximizing the ELBO effectively maximizes $\log p(x)$ while minimizing the approximation error.

Deconstructing the ELBO

The ELBO can be rearranged into a more intuitive form that highlights the two primary objectives of a VAE: Starting from $L_{ELBO} = E_{q_\phi(z|x)} [ \log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x) ]$ , we can group terms:

L_{ELBO}(\phi, \theta; x) = E_{q_\phi(z|x)} [ \log p_\theta(x|z) ] - E_{q_\phi(z|x)} [ \log q_\phi(z|x) - \log p(z) ]

The second term is the definition of the KL divergence between $q_\phi(z|x)$ and $p(z)$ :

D_{KL}(q_\phi(z|x) || p(z)) = E_{q_\phi(z|x)} \left[ \log \frac{q_\phi(z|x)}{p(z)} \right] = E_{q_\phi(z|x)} [ \log q_\phi(z|x) - \log p(z) ]

Thus, the ELBO becomes:

L_{ELBO}(\phi, \theta; x) = \underbrace{E_{q_\phi(z|x)} [ \log p_\theta(x|z) ]}_{\text{Reconstruction Likelihood}} - \underbrace{D_{KL}(q_\phi(z|x) || p(z))}_{\text{KL Regularizer}}

Let's examine these two components:

Expected Reconstruction Log-Likelihood: $E_{q_\phi(z|x)} [ \log p_\theta(x|z) ]$ This term measures how well the decoder $p_\theta(x|z)$ can reconstruct the input data $x$ when given a latent code $z$ sampled from the encoder's approximate posterior $q_\phi(z|x)$ . It encourages the model to learn latent representations $z$ that retain sufficient information to rebuild $x$ . This is the "autoencoding" part of the VAE. The specific form of $\log p_\theta(x|z)$ depends on the data type:
- For binary data (e.g., black and white images), $p_\theta(x|z)$ is often modeled as a product of Bernoulli distributions. Maximizing $\log p_\theta(x|z)$ then corresponds to minimizing the binary cross-entropy (BCE) loss between the input $x$ and the reconstructed output $\hat{x} = \text{decoder}(z)$ .
- For real-valued data (e.g., images with pixel intensities normalized to [0,1] or continuous signals), $p_\theta(x|z)$ is often modeled as a Gaussian distribution, $N(x | \mu_\theta(z), \sigma^2 I)$ . If the variance $\sigma^2$ is fixed, maximizing this term is equivalent to minimizing the Mean Squared Error (MSE) between $x$ and the decoder's mean output $\mu_\theta(z)$ .
KL Divergence Regularizer: $D_{KL}(q_\phi(z|x) || p(z))$ This term acts as a regularizer on the latent space. It measures the dissimilarity between the approximate posterior distribution $q_\phi(z|x)$ (produced by the encoder for a given input $x$ ) and the prior distribution $p(z)$ over the latent variables. The prior $p(z)$ is typically chosen to be a simple, fixed distribution, most commonly a standard multivariate Gaussian, $N(0, I)$ . By minimizing this KL divergence (note the negative sign in the ELBO formulation, so we are effectively minimizing $D_{KL}$ by maximizing the ELBO term $-D_{KL}$ ), we encourage the encoder to produce latent distributions $q_\phi(z|x)$ that are, on average, close to the prior $p(z)$ . This has several benefits:
- Smoothness and Continuity: It helps to structure the latent space, making it more continuous and less prone to "holes" or disjoint regions. This is important for the generative capabilities of the VAE, as we want to be able to sample $z \sim p(z)$ and generate novel, coherent data.
- Regularization: It prevents the encoder from learning overly complex or "cheating" posteriors that might simply memorize the input data in $z$ .

The ELBO comprises two main terms. The first is the expected reconstruction log-likelihood, which pushes the model to accurately reconstruct data. The second is a KL divergence term that regularizes the latent space by encouraging the approximate posterior $q_\phi(z|x)$ to be close to a predefined prior $p(z)$ .

The KL Divergence Term in Practice

A common choice for both the prior $p(z)$ and the approximate posterior $q_\phi(z|x)$ is a multivariate Gaussian distribution. Let $p(z) = N(z | 0, I)$ , a standard Gaussian with zero mean and identity covariance matrix. Let the approximate posterior $q_\phi(z|x)$ also be a Gaussian, but with mean $\mu_\phi(x)$ and a diagonal covariance matrix $\text{diag}(\sigma^2_{\phi,1}(x), ..., \sigma^2_{\phi,J}(x))$ , where $J$ is the dimensionality of the latent space. The encoder network will output the parameters $\mu_\phi(x)$ and $\log(\sigma^2_\phi(x))$ (or $\sigma_\phi(x)$ directly) for each input $x$ .

For these choices, the KL divergence $D_{KL}(q_\phi(z|x) || p(z))$ has a convenient analytical solution:

D_{KL}(N(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x))) || N(0, I)) = \frac{1}{2} \sum_{j=1}^{J} \left( \mu_{\phi,j}(x)^2 + \sigma^2_{\phi,j}(x) - \log(\sigma^2_{\phi,j}(x)) - 1 \right)

This closed-form expression can be directly incorporated into the VAE's loss function and optimized via gradient descent. The reparameterization trick, which we will discuss in the next section, is essential for backpropagating gradients through the sampling process involved in the expectation $E_{q_\phi(z|x)}$ .

In summary, the ELBO provides a tractable objective function for training VAEs. It elegantly balances the need for accurate data reconstruction with the need for a regularized, smooth latent space suitable for generation. By maximizing the ELBO, we are simultaneously improving our model of the data $p(x)$ and refining our approximation $q_\phi(z|x)$ to the true, intractable posterior $p(z|x)$ . Understanding this formulation is foundational for comprehending how VAEs learn and for developing more advanced VAE architectures and techniques.

Was this section helpful?