Before we explore the attention mechanism that powers Transformer models, it's helpful to understand how sequential data was traditionally handled. For tasks like translating sentences or predicting the next word in a text, the order of elements matters significantly. Feed-forward neural networks, which process inputs independently, aren't inherently suited for capturing these sequential dependencies. This led to the development of Recurrent Neural Networks (RNNs).
The defining characteristic of an RNN is its internal "memory" or hidden state. Unlike a standard feed-forward network, an RNN processes a sequence element by element (e.g., word by word). At each step, it takes the current input element and the hidden state from the previous step to compute a new hidden state. This hidden state acts as a running summary of the information seen so far in the sequence.
Think of it like reading a sentence: you process words one after another, and your understanding at any point depends on the words you've already read. The RNN's hidden state tries to capture this evolving context.
At its core, an RNN cell performs the same computation at each time step, but its internal state changes based on the input sequence. If we have an input sequence x=(x1,x2,...,xT), the RNN updates its hidden state ht at time step t using the current input xt and the previous hidden state ht−1. A common formulation for this update is:
ht=tanh(Whhht−1+Wxhxt+bh)Here:
Optionally, the RNN can also produce an output yt at each time step, often calculated from the hidden state:
yt=Whyht+byWhere Why is the hidden-to-output weight matrix and by is the output bias. Whether an output is needed at every step depends on the specific task (e.g., predicting the next word at each step vs. classifying the entire sequence).
It's often easier to visualize an RNN by "unrolling" it through time. Imagine making a separate copy of the network for each time step, with the hidden state passed from one copy to the next.
An RNN unrolled through time. The same RNN cell (with shared weights) processes input xt and the previous hidden state ht−1 to produce the new hidden state ht and output yt. The hidden state flows from one time step to the next.
This recurrent structure allows RNNs to theoretically model dependencies across arbitrary lengths in a sequence, as information can propagate through the hidden states. For many years, RNNs (and their more sophisticated variants like LSTMs and GRUs, designed to better handle long-range dependencies) were the standard choice for sequence modeling tasks.
However, as we will see in the next section, processing sequences purely sequentially like this comes with its own set of significant challenges, particularly when dealing with very long sequences or complex dependencies. These limitations motivated the development of alternative architectures, ultimately leading to the Transformer.
© 2025 ApX Machine Learning