Transformers fundamentally operate on sequences of tokens, effectively modeling relationships between elements in a 1D structure. This contrasts sharply with the grid-like, 2D (or higher dimensional) nature of images, where convolutional neural networks (CNNs) naturally excel due to their built-in inductive biases for locality and spatial hierarchies. To apply the power of transformer self-attention to image data within a diffusion model, we first need a method to represent an image as a sequence that a transformer can process. The dominant approach borrows directly from the architecture that popularized transformers for vision tasks: the Vision Transformer (ViT).
The core strategy involves dividing the input image into a grid of smaller, non-overlapping patches. Imagine taking an image and cutting it up into a series of square tiles. Each tile, or patch, is then treated as a single "token" in the sequence that will be fed into the transformer.
Let's consider an input image x∈RH×W×C, where H is the height, W is the width, and C is the number of channels (e.g., 3 for RGB). We divide this image into N patches, each of size P×P. The total number of patches is N=(H/P)×(W/P), assuming H and W are divisible by P.
Each patch is then flattened into a vector. A single patch originally has dimensions P×P×C, so its flattened representation becomes a vector of length P2⋅C.
These flattened patch vectors, while sequential, often have a very high dimensionality (P2⋅C). More importantly, they might not be in a representation space that's optimal for the transformer's self-attention mechanism to learn relationships. Therefore, each flattened patch vector is linearly projected into a lower-dimensional embedding space, typically denoted by dimension D. This is achieved using a learned linear projection matrix (an embedding layer):
Embedding=Flattened Patch×We+beWhere We∈R(P2⋅C)×D is the learnable weight matrix and be∈RD is the bias term. After this projection, our image is represented as a sequence of N vectors, each of dimension D. This sequence, [z1,z2,...,zN] where each zi∈RD, is now in a format suitable for input into a standard transformer architecture.
Process of converting an image into a sequence of patch embeddings suitable for a transformer.
A standard transformer architecture is permutation-invariant; if you shuffle the input sequence, the output (before considering the original positions) changes predictably based only on the shuffled content, not the original order. However, the spatial arrangement of patches in an image is fundamental information. The patch from the top-left corner contains different contextual information than the patch from the center.
To provide the transformer with this crucial spatial awareness, positional embeddings are added to the patch embeddings. These positional embeddings are vectors of the same dimension D as the patch embeddings. Typically, one unique positional embedding vector is learned for each possible patch position in the grid. Before feeding the sequence into the first transformer block, the corresponding positional embedding is added element-wise to each patch embedding:
zi′=zi+piWhere zi is the embedding of the i-th patch and pi is the positional embedding corresponding to the spatial location of the i-th patch. These positional embeddings pi are usually initialized randomly and learned jointly with the rest of the model during training. Alternative strategies exist, such as using fixed sinusoidal positional embeddings, similar to those used in the original transformer paper, sometimes adapted for 2D.
This entire process, patching, flattening, linear projection, and adding positional embeddings, is the standard input processing pipeline defined by the Vision Transformer (ViT) model. ViT demonstrated that a pure transformer architecture, when fed image data processed in this way, could achieve state-of-the-art results on image classification tasks, challenging the dominance of CNNs.
Diffusion Transformers (DiTs), which we will explore next, directly adopt this ViT-style input processing. Instead of feeding the patch sequence into a transformer for classification, DiTs use the transformer as the backbone network within the diffusion framework to predict noise (or the original image). The input to this process in a diffusion context is typically the noisy image xt at a given timestep t. The resulting sequence of processed patch embeddings, now infused with spatial information, serves as the input for the subsequent transformer blocks responsible for the core denoising task. Time and conditioning information are also incorporated, as we will discuss later.
This adaptation effectively translates the image generation problem into a sequence modeling problem solvable by transformers, leveraging their strength in capturing global dependencies between different parts (patches) of the image.
© 2025 ApX Machine Learning