Feedforward networks, like the MLPs we've discussed, process inputs independently. If you feed an MLP the same input twice, it produces the same output, unaware of any previous interactions. This works well for many tasks, but what about data where order matters? Consider predicting the next word in a sentence, analyzing stock market trends, or transcribing speech. The meaning or prediction often depends heavily on what came before. Feedforward networks lack an inherent mechanism to remember past information in the sequence.

Recurrent Neural Networks (RNNs) are designed specifically to handle this kind of sequential information. The core idea is recurrence: processing sequence elements one by one while maintaining an internal memory, often called the hidden state.

Imagine reading a sentence. You don't process each word in isolation. Your understanding of the current word is influenced by the words you've already read. RNNs mimic this process. At each step (e.g., for each word in a sentence or each point in a time series), the RNN performs a calculation based on two things:

The current input element ( $x_t$ at time step $t$ ).
The hidden state from the previous step ( $h_{t-1}$ ).

This calculation produces a new hidden state ( $h_t$ ) which captures information from the current input and relevant context from the past. This hidden state $h_t$ is then passed forward to the next time step ( $t+1$ ), acting as the network's memory of what it has seen so far.

This "looping" mechanism, where the output of a step feeds back into the input of the next step (via the hidden state), is what makes the network recurrent. Crucially, the same set of weights and biases are used for the calculation at every time step. This parameter sharing makes RNNs efficient and allows them to generalize patterns across different positions in sequences of varying lengths.

We can represent the update of the hidden state at time step $t$ mathematically. A common formulation uses an activation function (like tanh or ReLU) applied to a combination of the current input and the previous hidden state:

h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Here:

$h_t$ is the hidden state at the current time step $t$ .
$h_{t-1}$ is the hidden state from the previous time step $t-1$ .
$x_t$ is the input at the current time step $t$ .
$W_{hh}$ is the weight matrix applied to the previous hidden state.
$W_{xh}$ is the weight matrix applied to the current input.
$b_h$ is the bias term for the hidden state calculation.
$f$ is a non-linear activation function (e.g., tanh).

The network might also produce an output $y_t$ at each time step, often calculated based on the current hidden state:

y_t = g(W_{hy} h_t + b_y)

Where:

$y_t$ is the output at time step $t$ .
$W_{hy}$ is the weight matrix connecting the hidden state to the output.
$b_y$ is the output bias term.
$g$ is another activation function (e.g., softmax for classification).

Visually, we can think of "unrolling" the recurrence over time. The diagram below shows a simple RNN unrolled for three time steps. Notice how the hidden state $h$ is passed from one step to the next, carrying information along the sequence.

An RNN cell unrolled over time. The hidden state h acts as memory, carrying information from one time step to the next. The same weights (W_hh, W_xh, W_hy) are applied at each step.

This recurrent structure, centered around the evolving hidden state, allows RNNs to capture dependencies between elements in a sequence, making them suitable for tasks involving natural language, time series data, and other ordered inputs where context is significant.

Was this section helpful?