All Courses

Recap: Recurrent Neural Networks (RNNs)

Before we explore the attention mechanism that powers Transformer models, it's helpful to understand how sequential data was traditionally handled. For tasks like translating sentences or predicting the next word in a text, the order of elements matters significantly. Feed-forward neural networks, which process inputs independently, aren't inherently suited for capturing these sequential dependencies. This led to the development of Recurrent Neural Networks (RNNs).

The defining characteristic of an RNN is its internal "memory" or hidden state. Unlike a standard feed-forward network, an RNN processes a sequence element by element (e.g., word by word). At each step, it takes the current input element and the hidden state from the previous step to compute a new hidden state. This hidden state acts as a running summary of the information seen so far in the sequence.

Think of it like reading a sentence: you process words one after another, and your understanding at any point depends on the words you've already read. The RNN's hidden state tries to capture this evolving context.

The Recurrent Structure

At its core, an RNN cell performs the same computation at each time step, but its internal state changes based on the input sequence. If we have an input sequence $x = (x_1, x_2, ..., x_T)$ , the RNN updates its hidden state $h_t$ at time step $t$ using the current input $x_t$ and the previous hidden state $h_{t-1}$ . A common formulation for this update is:

h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Here:

$h_t$ is the new hidden state at time step $t$ .
$h_{t-1}$ is the hidden state from the previous time step (often initialized to zeros for $h_0$ ).
$x_t$ is the input vector at time step $t$ .
$W_{hh}$ and $W_{xh}$ are weight matrices for the hidden-to-hidden and input-to-hidden connections, respectively. These weights are shared across all time steps, which is fundamental to how RNNs learn patterns in sequences.
$b_h$ is a bias vector.
$\tanh$ is a common activation function (hyperbolic tangent), introducing non-linearity.

Optionally, the RNN can also produce an output $y_t$ at each time step, often calculated from the hidden state:

y_t = W_{hy} h_t + b_y

Where $W_{hy}$ is the hidden-to-output weight matrix and $b_y$ is the output bias. Whether an output is needed at every step depends on the specific task (e.g., predicting the next word at each step vs. classifying the entire sequence).

Visualizing the Process

It's often easier to visualize an RNN by "unrolling" it through time. Imagine making a separate copy of the network for each time step, with the hidden state passed from one copy to the next.

An RNN unrolled through time. The same RNN cell (with shared weights) processes input $x_t$ and the previous hidden state $h_{t-1}$ to produce the new hidden state $h_t$ and output $y_t$ . The hidden state flows from one time step to the next.

This recurrent structure allows RNNs to theoretically model dependencies across arbitrary lengths in a sequence, as information can propagate through the hidden states. For many years, RNNs (and their more sophisticated variants like LSTMs and GRUs, designed to better handle long-range dependencies) were the standard choice for sequence modeling tasks.

However, as we will see in the next section, processing sequences purely sequentially like this comes with its own set of significant challenges, particularly when dealing with very long sequences or complex dependencies. These limitations motivated the development of alternative architectures, ultimately leading to the Transformer.

Was this section helpful?