As discussed previously, the self-attention mechanism operates on input elements without an inherent understanding of their order in the sequence. This permutation invariance is powerful for capturing relationships regardless of distance but fails to model the sequential nature essential for language and other ordered data. To address this, the Transformer architecture introduces positional encodings, which are vectors added to the input embeddings to provide the model with information about the position of each token.
The original Transformer paper ("Attention Is All You Need") proposed a fixed, non-learned method using sine and cosine functions of different frequencies. This approach, known as sinusoidal positional encoding, generates a unique encoding vector for each position in the sequence up to a predefined maximum length.
For a given position pos in the sequence (where pos=0,1,2,...) and a given dimension i within the dmodel-dimensional embedding vector (where i=0,1,...,dmodel−1), the positional encoding PE is defined by applying sine functions to the even dimension indices (2i) and cosine functions to the odd dimension indices (2i+1):
PE(pos,2i)=sin(100002i/dmodelpos) PE(pos,2i+1)=cos(100002i/dmodelpos)Here:
The term 1/100002i/dmodel determines the angular frequency of the sine and cosine waves. Let's analyze its behavior:
This design creates positional encoding vectors where each dimension corresponds to a sinusoid of a specific frequency. The combination of these sinusoids across all dimensions generates a unique signature for each position pos.
Heatmap showing the sinusoidal positional encoding values for the first 50 positions (sampled every 2 positions) across the first 128 dimensions (sampled every 4 dimensions). Each row represents a dimension index, and each column represents a position in the sequence. The color intensity indicates the value of the encoding (ranging from -1 to 1). Notice the different frequencies of oscillation along the dimension axis (y-axis).
An important property derived from these trigonometric identities is that the encoding for a position pos+k can be represented as a linear transformation of the encoding for position pos. This linearity property might help the model easily learn to attend based on relative positions, as the relationship between the encodings of tokens separated by a fixed distance k remains consistent regardless of their absolute positions.
In summary, the sinusoidal positional encoding provides a deterministic and computationally efficient way to inject sequence order information into the Transformer. It generates unique positional signatures and possesses properties beneficial for modeling relative positions, without requiring additional learnable parameters.
© 2025 ApX Machine Learning