All Courses

Adapting Transformers for Image Data (ViT, Patch Embeddings)

Transformers fundamentally operate on sequences of tokens, effectively modeling relationships between elements in a 1D structure. This contrasts sharply with the grid-like, 2D (or higher dimensional) nature of images, where convolutional neural networks (CNNs) naturally excel due to their built-in inductive biases for locality and spatial hierarchies. To apply the power of transformer self-attention to image data within a diffusion model, we first need a method to represent an image as a sequence that a transformer can process. The dominant approach borrows directly from the architecture that popularized transformers for vision tasks: the Vision Transformer (ViT).

From Pixels to Patches: Sequencing Image Data

The core strategy involves dividing the input image into a grid of smaller, non-overlapping patches. Imagine taking an image and cutting it up into a series of square tiles. Each tile, or patch, is then treated as a single "token" in the sequence that will be fed into the transformer.

Let's consider an input image $x \in \mathbb{R}^{H \times W \times C}$ , where $H$ is the height, $W$ is the width, and $C$ is the number of channels (e.g., 3 for RGB). We divide this image into $N$ patches, each of size $P \times P$ . The total number of patches is $N = (H/P) \times (W/P)$ , assuming $H$ and $W$ are divisible by $P$ .

Each patch is then flattened into a vector. A single patch originally has dimensions $P \times P \times C$ , so its flattened representation becomes a vector of length $P^2 \cdot C$ .

Linear Projection: Creating Patch Embeddings

These flattened patch vectors, while sequential, often have a very high dimensionality ( $P^2 \cdot C$ ). More importantly, they might not be in a representation space that's optimal for the transformer's self-attention mechanism to learn relationships. Therefore, each flattened patch vector is linearly projected into a lower-dimensional embedding space, typically denoted by dimension $D$ . This is achieved using a learned linear projection matrix (an embedding layer):

\text{Embedding} = \text{Flattened Patch} \times W_e + b_e

Where $W_e \in \mathbb{R}^{(P^2 \cdot C) \times D}$ is the learnable weight matrix and $b_e \in \mathbb{R}^{D}$ is the bias term. After this projection, our image is represented as a sequence of $N$ vectors, each of dimension $D$ . This sequence, $[z_1, z_2, ..., z_N]$ where each $z_i \in \mathbb{R}^D$ , is now in a format suitable for input into a standard transformer architecture.

Process of converting an image into a sequence of patch embeddings suitable for a transformer.

Incorporating Spatial Information: Positional Embeddings

A standard transformer architecture is permutation-invariant; if you shuffle the input sequence, the output (before considering the original positions) changes predictably based only on the shuffled content, not the original order. However, the spatial arrangement of patches in an image is fundamental information. The patch from the top-left corner contains different contextual information than the patch from the center.

To provide the transformer with this important spatial awareness, positional embeddings are added to the patch embeddings. These positional embeddings are vectors of the same dimension $D$ as the patch embeddings. Typically, one unique positional embedding vector is learned for each possible patch position in the grid. Before feeding the sequence into the first transformer block, the corresponding positional embedding is added element-wise to each patch embedding:

z'_i = z_i + p_i

Where $z_i$ is the embedding of the $i$ -th patch and $p_i$ is the positional embedding corresponding to the spatial location of the $i$ -th patch. These positional embeddings $p_i$ are usually initialized randomly and learned jointly with the rest of the model during training. Alternative strategies exist, such as using fixed sinusoidal positional embeddings, similar to those used in the original transformer paper, sometimes adapted for 2D.

Relation to Vision Transformer (ViT)

This entire process, patching, flattening, linear projection, and adding positional embeddings, is the standard input processing pipeline defined by the Vision Transformer (ViT) model. ViT demonstrated that a pure transformer architecture, when fed image data processed in this way, could achieve state-of-the-art results on image classification tasks, challenging the dominance of CNNs.

Diffusion Transformers (DiTs), which we will explore next, directly adopt this ViT-style input processing. Instead of feeding the patch sequence into a transformer for classification, DiTs use the transformer as the backbone network within the diffusion framework to predict noise (or the original image). The input to this process in a diffusion context is typically the noisy image $x_t$ at a given timestep $t$ . The resulting sequence of processed patch embeddings, now infused with spatial information, serves as the input for the subsequent transformer blocks responsible for the core denoising task. Time and conditioning information are also incorporated, as we will discuss later.

This adaptation effectively translates the image generation problem into a sequence modeling problem solvable by transformers, leveraging their strength in capturing global dependencies between different parts (patches) of the image.

Was this section helpful?