As we move from sequence models like Recurrent Neural Networks (RNNs) to the Transformer architecture, we encounter a fundamental difference in how sequential data is processed. RNNs, by their very nature, process input tokens one after another. This sequential processing inherently incorporates the order of the tokens. The hidden state at step t is a function of the input at step t and the hidden state at step t−1. This dependency chain naturally encodes the position of each element within the sequence.
Transformers, however, take a different approach. As discussed in the previous chapter, the self-attention mechanism allows the model to weigh the importance of all tokens in the input sequence simultaneously when calculating the representation for any given token. Input embeddings flow through the self-attention layers largely in parallel. While this parallelism is a significant advantage for computation speed and capturing long-range dependencies, it comes at a cost: the core self-attention mechanism itself does not inherently consider the order of the input tokens.
Consider the standard scaled dot-product attention calculation:
Attention(Q,K,V)=softmax(dkQKT)VHere, Q (Query), K (Key), and V (Value) matrices are derived from the input embeddings. If we were to shuffle the order of the input tokens (and their corresponding embeddings), the set of computed attention scores between pairs of tokens would remain the same, although they would be associated with different output positions. The mechanism calculates how much each token should attend to every other token based on their embedding content, not their position.
Let's illustrate with a simple example. Imagine the sentences:
These sentences have vastly different meanings, conveyed entirely by the order of the words. If we simply feed the word embeddings for "dog," "bites," and "man" into a basic self-attention layer without any positional context, the model would struggle to distinguish between these two scenarios. It would see the same collection of input vectors, just potentially arranged differently within the internal matrices. The attention calculation, being permutation-invariant with respect to the input set, lacks the built-in sequential awareness of RNNs.
Without explicit information about token positions, the Transformer would essentially perceive the input as an unordered set or "bag" of tokens. This loss of sequential order is detrimental for most natural language processing tasks, where word order is fundamental to meaning, grammar, and syntax.
Therefore, to enable the Transformer to understand the sequence order, we must provide it with explicit positional information. We need a way to inject signals into the input representations that tell the model where each token appears in the sequence (e.g., "this is the first word," "this is the second word," etc.). This injected information must be combined with the token embeddings before they are processed by the main encoder or decoder stacks. The next section, "Positional Encoding Explained," details how this is typically achieved using mathematical functions.
© 2025 ApX Machine Learning