As we established in the chapter introduction, Variational Autoencoders (VAEs) are derived using the principles of variational inference. Our ultimate goal in generative modeling is often to estimate the probability distribution of the observed data, p(x). For models with latent variables z, this involves calculating the marginal likelihood:
p(x)=∫p(x,z)dz=∫pθ(x∣z)p(z)dzwhere p(z) is the prior distribution over the latent variables and pθ(x∣z) is the likelihood of the data given the latent variables, typically parameterized by a decoder network with parameters θ.
However, this integral is frequently intractable for complex models and high-dimensional latent spaces. This intractability extends to the true posterior distribution p(z∣x)=p(x∣z)p(z)/p(x), as its denominator p(x) is the very integral we cannot compute. Variational inference addresses this by introducing an approximation to the true posterior, denoted as qϕ(z∣x). This approximate posterior is typically parameterized by an encoder network with parameters ϕ.
The core idea is to make qϕ(z∣x) as close as possible to the true posterior p(z∣x). We measure this "closeness" using the Kullback-Leibler (KL) divergence, DKL(qϕ(z∣x)∣∣p(z∣x)). Our objective is to find parameters ϕ that minimize this KL divergence.
Let's begin with the log-likelihood of the data, logp(x), and see how qϕ(z∣x) and the Evidence Lower Bound (ELBO) emerge.
logp(x)=log∫p(x,z)dzWe can multiply and divide by qϕ(z∣x) inside the integral (assuming qϕ(z∣x)>0 where p(x,z)>0):
logp(x)=log∫qϕ(z∣x)qϕ(z∣x)p(x,z)dzThis can be rewritten as the logarithm of an expectation with respect to qϕ(z∣x):
logp(x)=logEqϕ(z∣x)[qϕ(z∣x)p(x,z)]Since the logarithm is a concave function, we can apply Jensen's inequality (logE[Y]≥E[logY]) to move the logarithm inside the expectation:
logp(x)≥Eqϕ(z∣x)[logqϕ(z∣x)p(x,z)]This lower bound is precisely the Evidence Lower Bound (ELBO), often denoted as LELBO or simply L(ϕ,θ;x):
LELBO(ϕ,θ;x)=Eqϕ(z∣x)[logp(x,z)−logqϕ(z∣x)]By expanding p(x,z)=pθ(x∣z)p(z), we get another common form:
LELBO(ϕ,θ;x)=Eqϕ(z∣x)[logpθ(x∣z)+logp(z)−logqϕ(z∣x)]The difference between the true log-likelihood logp(x) and the ELBO is exactly the KL divergence between the approximate posterior and the true posterior:
logp(x)−LELBO(ϕ,θ;x)=Eqϕ(z∣x)[logqϕ(z∣x)−logp(z∣x)]=DKL(qϕ(z∣x)∣∣p(z∣x))So, we have the fundamental relationship:
logp(x)=LELBO(ϕ,θ;x)+DKL(qϕ(z∣x)∣∣p(z∣x))Since the KL divergence is always non-negative (DKL≥0), the ELBO is indeed a lower bound on the log-likelihood of the data. Maximizing the ELBO with respect to ϕ and θ serves two purposes:
The log marginal likelihood logp(x) is decomposed into the Evidence Lower Bound (ELBO) and the KL divergence between the approximate posterior qϕ(z∣x) and the true posterior p(z∣x). Maximizing the ELBO effectively maximizes logp(x) while minimizing the approximation error.
The ELBO can be rearranged into a more intuitive form that highlights the two primary objectives of a VAE: Starting from LELBO=Eqϕ(z∣x)[logpθ(x∣z)+logp(z)−logqϕ(z∣x)], we can group terms:
LELBO(ϕ,θ;x)=Eqϕ(z∣x)[logpθ(x∣z)]−Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)]The second term is the definition of the KL divergence between qϕ(z∣x) and p(z):
DKL(qϕ(z∣x)∣∣p(z))=Eqϕ(z∣x)[logp(z)qϕ(z∣x)]=Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)]Thus, the ELBO becomes:
LELBO(ϕ,θ;x)=Reconstruction LikelihoodEqϕ(z∣x)[logpθ(x∣z)]−KL RegularizerDKL(qϕ(z∣x)∣∣p(z))Let's examine these two components:
Expected Reconstruction Log-Likelihood: Eqϕ(z∣x)[logpθ(x∣z)] This term measures how well the decoder pθ(x∣z) can reconstruct the input data x when given a latent code z sampled from the encoder's approximate posterior qϕ(z∣x). It encourages the model to learn latent representations z that retain sufficient information to rebuild x. This is the "autoencoding" part of the VAE. The specific form of logpθ(x∣z) depends on the data type:
KL Divergence Regularizer: DKL(qϕ(z∣x)∣∣p(z)) This term acts as a regularizer on the latent space. It measures the dissimilarity between the approximate posterior distribution qϕ(z∣x) (produced by the encoder for a given input x) and the prior distribution p(z) over the latent variables. The prior p(z) is typically chosen to be a simple, fixed distribution, most commonly a standard multivariate Gaussian, N(0,I). By minimizing this KL divergence (note the negative sign in the ELBO formulation, so we are effectively minimizing DKL by maximizing the ELBO term −DKL), we encourage the encoder to produce latent distributions qϕ(z∣x) that are, on average, close to the prior p(z). This has several benefits:
The ELBO comprises two main terms. The first is the expected reconstruction log-likelihood, which pushes the model to accurately reconstruct data. The second is a KL divergence term that regularizes the latent space by encouraging the approximate posterior qϕ(z∣x) to be close to a predefined prior p(z).
A common choice for both the prior p(z) and the approximate posterior qϕ(z∣x) is a multivariate Gaussian distribution. Let p(z)=N(z∣0,I), a standard Gaussian with zero mean and identity covariance matrix. Let the approximate posterior qϕ(z∣x) also be a Gaussian, but with mean μϕ(x) and a diagonal covariance matrix diag(σϕ,12(x),...,σϕ,J2(x)), where J is the dimensionality of the latent space. The encoder network will output the parameters μϕ(x) and log(σϕ2(x)) (or σϕ(x) directly) for each input x.
For these choices, the KL divergence DKL(qϕ(z∣x)∣∣p(z)) has a convenient analytical solution:
DKL(N(μϕ(x),diag(σϕ2(x)))∣∣N(0,I))=21j=1∑J(μϕ,j(x)2+σϕ,j2(x)−log(σϕ,j2(x))−1)This closed-form expression can be directly incorporated into the VAE's loss function and optimized via gradient descent. The reparameterization trick, which we will discuss in the next section, is essential for backpropagating gradients through the sampling process involved in the expectation Eqϕ(z∣x).
In summary, the ELBO provides a tractable objective function for training VAEs. It elegantly balances the need for accurate data reconstruction with the need for a regularized, smooth latent space suitable for generation. By maximizing the ELBO, we are simultaneously improving our model of the data p(x) and refining our approximation qϕ(z∣x) to the true, intractable posterior p(z∣x). Understanding this formulation is foundational for comprehending how VAEs learn and for developing more advanced VAE architectures and techniques.
Was this section helpful?
© 2025 ApX Machine Learning