As discussed previously, the self-attention mechanism operates on input elements without an inherent understanding of their order in the sequence. This permutation invariance is powerful for capturing relationships regardless of distance but fails to model the sequential nature essential for language and other ordered data. To address this, the Transformer architecture introduces positional encodings, which are vectors added to the input embeddings to provide the model with information about the position of each token.

The original Transformer paper ("Attention Is All You Need") proposed a fixed, non-learned method using sine and cosine functions of different frequencies. This approach, known as sinusoidal positional encoding, generates a unique encoding vector for each position in the sequence up to a predefined maximum length.

Mathematical Formulation

For a given position $pos$ in the sequence (where $pos = 0, 1, 2, ...$ ) and a given dimension $i$ within the $d_{model}$ -dimensional embedding vector (where $i = 0, 1, ..., d_{model}-1$ ), the positional encoding $PE$ is defined by applying sine functions to the even dimension indices ( $2i$ ) and cosine functions to the odd dimension indices ( $2i+1$ ):

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Here:

$pos$ represents the index of the token's position in the input sequence.
$i$ represents the index of the dimension within the encoding vector. Note that $i$ here effectively ranges from $0$ up to $d_{model}/2 - 1$ , as each $i$ defines a pair of dimensions ( $2i$ and $2i+1$ ).
$d_{model}$ is the dimensionality of the embedding vectors used throughout the Transformer model (e.g., 512, 768).

The term $1 / 10000^{2i/d_{model}}$ determines the angular frequency of the sine and cosine waves. Let's analyze its behavior:

When $i$ is small (representing the initial dimensions of the encoding vector), the exponent $2i/d_{model}$ is close to 0. The base $10000^{2i/d_{model}}$ is close to $10000^0 = 1$ . This results in a relatively large argument $pos/1$ , meaning the sine/cosine functions oscillate with a high frequency (short wavelength) across positions for these dimensions.
As $i$ increases towards $d_{model}/2$ , the exponent $2i/d_{model}$ approaches 1. The base $10000^{2i/d_{model}}$ approaches $10000^1 = 10000$ . This results in a smaller argument $pos/10000$ , meaning the functions oscillate with a very low frequency (long wavelength) across positions for these later dimensions.

This design creates positional encoding vectors where each dimension corresponds to a sinusoid of a specific frequency. The combination of these sinusoids across all dimensions generates a unique signature for each position $pos$ .

Heatmap showing the sinusoidal positional encoding values for the first 50 positions (sampled every 2 positions) across the first 128 dimensions (sampled every 4 dimensions). Each row represents a dimension index, and each column represents a position in the sequence. The color intensity indicates the value of the encoding (ranging from -1 to 1). Notice the different frequencies of oscillation along the dimension axis (y-axis).

An important property derived from these trigonometric identities is that the encoding for a position $pos + k$ can be represented as a linear transformation of the encoding for position $pos$ . This linearity property might help the model easily learn to attend based on relative positions, as the relationship between the encodings of tokens separated by a fixed distance $k$ remains consistent regardless of their absolute positions.

In summary, the sinusoidal positional encoding provides a deterministic and computationally efficient way to inject sequence order information into the Transformer. It generates unique positional signatures and possesses properties beneficial for modeling relative positions, without requiring additional learnable parameters.

Was this section helpful?