All Courses

Recurrent Neural Networks (RNNs) and LSTMs in Flux

While Multilayer Perceptrons (MLPs) are great for tabular data and Convolutional Neural Networks (CNNs) excel with grid-like data such as images, many real-world problems involve sequences. Think of sentences, stock prices over time, or sensor readings. For these, we need models that can understand order and context across time or sequence steps. This is where Recurrent Neural Networks (RNNs) and their more advanced variants like Long Short-Term Memory (LSTM) units come into play.

Unlike feedforward networks, RNNs have loops, allowing information to persist from one step of the sequence to the next. This "memory" is what enables them to learn dependencies across sequence elements.

The Core Idea: Processing Sequences with Memory

At the heart of an RNN is a recurrent cell. This cell processes an input at the current time step (or sequence position) and combines it with a hidden state from the previous time step. This hidden state acts as the network's memory, carrying information from earlier parts of the sequence. The cell then produces an output for the current time step and updates its hidden state to be passed to the next time step.

An RNN cell processes the current input and the previous hidden state to produce an output and an updated hidden state.

In Flux.jl, you can define a basic RNN cell using RNNCell. For processing an entire sequence, you typically wrap this cell with Recur.

using Flux

# Define input feature size and hidden state size
input_size = 10
hidden_size = 20

# Create a basic RNN layer
rnn_layer = Flux.RNN(input_size, hidden_size, σ) # σ is the activation function, e.g., tanh

# Example input: a sequence of 5 items, each with 10 features, for a batch of 1
# Flux expects (features, sequence_length, batch_size) for sequence layers
# Or, if processing step-by-step, (features, batch_size)
sample_sequence_batch = [rand(Float32, input_size) for _ in 1:5] # Vector of matrices (or just vectors if batch_size=1)
# For a batch of 3 sequences, each of length 5 and 10 features:
# sample_sequence_batch = [rand(Float32, input_size, 3) for _ in 1:5]

# To process a single step (if you had an RNNCell)
# rnn_cell = Flux.RNNCell(input_size, hidden_size, tanh)
# initial_hidden_state = rnn_cell.state0(1) # For batch size 1
# output_step1, next_hidden_state = rnn_cell(initial_hidden_state, sample_sequence_batch[1])

# Processing the whole sequence with RNN layer
# Note: RNN layer handles hidden state internally when processing a sequence.
# To get hidden states at each step, you'd iterate manually or use a different approach.
output_sequence = rnn_layer.(sample_sequence_batch)
final_hidden_state = rnn_layer.state # Access final hidden state

# To reset the hidden state for a new sequence batch
Flux.reset!(rnn_layer)

println("Output of the last step (for the first item in batch): ", output_sequence[end][:, 1])
println("Final hidden state shape: ", size(final_hidden_state))

A common way to structure input for Flux's recurrent layers like RNN, LSTM, or GRU when processing entire sequences is a vector of matrices. Each matrix in the vector represents one time step across all batches, with dimensions (features, batch_size). The vector itself has a length equal to the sequence length. Alternatively, for some layers or custom loops, you might use a 3D array of shape (features, sequence_length, batch_size).

The Challenge of Long-Term Dependencies

Simple RNNs, while elegant, struggle with learning dependencies over long sequences. This is due to the vanishing or exploding gradient problem. During backpropagation, gradients can shrink exponentially (vanish) or grow exponentially (explode) as they are propagated back through many time steps. Vanishing gradients make it difficult for the network to learn connections between distant elements in a sequence, while exploding gradients can make training unstable.

Long Short-Term Memory (LSTM) Networks

LSTMs were designed specifically to address the vanishing gradient problem and better capture long-range dependencies. They achieve this with a more complex cell structure that includes several "gates" controlling the flow of information.

An LSTM cell maintains a cell state ( $c_t$ ) in addition to the hidden state ( $h_t$ ). This cell state acts like a conveyor belt, allowing information to flow through relatively unchanged, which helps preserve gradients over long durations. The gates are:

Forget Gate ( $f_t$ ): Decides what information to discard from the cell state. It looks at $h_{t-1}$ and $x_t$ and outputs a number between 0 and 1 for each number in the cell state $c_{t-1}$ . A 1 represents "completely keep this" while a 0 represents "completely get rid of this." $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
Input Gate ( $i_t$ ): Decides which new information to store in the cell state. This has two parts:
- The input gate layer ( $i_t$ ) decides which values will be updated. $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
- A tanh layer creates a vector of new candidate values, $\tilde{C}_t$ , that could be added to the state. $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$ These two are combined to update the cell state: $c_t = f_t * c_{t-1} + i_t * \tilde{C}_t$ .
Output Gate ( $o_t$ ): Decides what to output as the hidden state $h_t$ $h_{t}$ . The output is based on the cell state $c_t$ $c_{t}$ , but is a filtered version.
- First, a sigmoid layer decides which parts of the cell state to output. $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
- Then, the cell state goes through $\tanh$ (to push values to be between -1 and 1) and is multiplied by the output of the sigmoid gate, so that only the parts decided earlier are outputted. $h_t = o_t * \tanh(c_t)$

Simplified structure of an LSTM cell showing the gates and cell state interactions. The cell state ( $c_t$ ) acts as a conveyor belt, modified by the forget and input gates. The output gate filters the cell state to produce the hidden state ( $h_t$ ).

In Flux.jl, creating an LSTM layer is straightforward:

using Flux

input_size = 10
hidden_size = 20 # This is the size of h_t and also c_t

# Create an LSTM layer
lstm_layer = Flux.LSTM(input_size, hidden_size)

# Example input sequence (vector of 5 matrices, each for one time step)
# Each matrix: (features, batch_size)
# Here, batch_size = 1 for simplicity
sample_sequence = [rand(Float33, input_size, 1) for _ in 1:5]

# Process the sequence
output_lstm_sequence = lstm_layer.(sample_sequence)

# The lstm_layer.state contains a tuple (h, c) for the final hidden and cell states
final_hidden_state_h, final_cell_state_c = lstm_layer.state

println("Output of LSTM at last step: ", size(output_lstm_sequence[end]))
println("Final hidden state (h) shape: ", size(final_hidden_state_h))
println("Final cell state (c) shape: ", size(final_cell_state_c))

# Reset for next batch/sequence
Flux.reset!(lstm_layer)

Flux's LSTM layer, like RNN, handles the state automatically when processing a sequence represented as a vector of inputs (one for each time step).

Gated Recurrent Units (GRUs)

GRUs are a newer generation of recurrent units, introduced by Cho et al. in 2014. They are similar to LSTMs but have a simpler architecture, combining the forget and input gates into a single "update gate" and merging the cell state and hidden state. Despite their simplicity, GRUs often perform comparably to LSTMs on many tasks and can be computationally faster.

A GRU cell has two main gates:

Reset Gate ( $r_t$ ): Determines how much of the previous hidden state to forget.
Update Gate ( $z_t$ ): Determines how much of the previous hidden state to keep and how much of the new candidate hidden state to incorporate.

Flux provides GRU and GRUCell for building models with Gated Recurrent Units:

using Flux

input_size = 10
hidden_size = 20

# Create a GRU layer
gru_layer = Flux.GRU(input_size, hidden_size)

# Example input sequence
sample_sequence = [rand(Float32, input_size, 1) for _ in 1:5] # batch_size = 1

# Process the sequence
output_gru_sequence = gru_layer.(sample_sequence)

# The gru_layer.state contains the final hidden state
final_hidden_state = gru_layer.state

println("Output of GRU at last step: ", size(output_gru_sequence[end]))
println("Final hidden state shape: ", size(final_hidden_state))

# Reset for next batch/sequence
Flux.reset!(gru_layer)

Structuring Sequential Models

RNNs, LSTMs, and GRUs form the core of models designed for sequential data. They are often combined with other layer types:

Embedding Layers: For textual or categorical sequence data, an Flux.Embedding layer (covered in the "Working with Embeddings for Sequential Data" section) is typically used first to convert discrete tokens into dense vector representations.
Dense Layers: After the recurrent layers have processed the sequence, one or more Flux.Dense layers are often used to transform the final hidden state (or a combination of hidden states) into the desired output format (e.g., class probabilities for classification, a continuous value for regression).
Stacking Recurrent Layers: You can stack multiple recurrent layers (e.g., LSTM on top of LSTM) to create deeper models capable of learning more complex hierarchical features from sequences. The output sequence of one recurrent layer becomes the input sequence for the next.

Here's an example of a simple sequence-to-one model structure, perhaps for sentiment classification where the input is a sequence of word embeddings and the output is a single sentiment score:

using Flux

vocab_size = 1000  # Number of unique words in vocabulary
embed_size = 50   # Dimension of word embeddings
hidden_size = 64  # LSTM hidden state size
output_size = 1   # Single output for regression (or num_classes for classification)

model = Chain(
    Embedding(vocab_size, embed_size), # Input: integer word indices
    LSTM(embed_size, hidden_size),
    # To get the output of the last time step only for the next layer:
    x -> x[end], # This selects the last hidden state from the sequence of hidden states
    # If LSTM returned only the last state, this wouldn't be needed.
    # Flux's LSTM layer called on a sequence returns a sequence of outputs.
    # A common pattern is to take the last output h_T.
    # An alternative is to use Flux.Recur(LSTMCell(...)) and then extract the final state.
    Dense(hidden_size, output_size),
    # sigmoid # For binary classification or if output should be in [0,1]
)

# Example: a sequence of 10 word indices for a single batch item
sample_input_indices = [rand(1:vocab_size) for _ in 1:10] # Vector of integers

# Flux's Embedding layer expects a vector or matrix of integers.
# If passing a single sequence (batch size 1):
# For a batch of sequences, it should be a matrix (vocab_indices, batch_size) for each step
# Or, if feeding directly to Chain, it should handle it as one batch item.
# However, `Embedding` layer needs special handling for sequences.
# Usually, you'd apply Embedding to each element of the sequence.

# Let's adjust for how LSTM expects input (vector of matrices)
# 1. Embed each word index
embedded_sequence = [model[1]([idx]) for idx in sample_input_indices] # Vector of (embed_size, 1) matrices

# 2. Pass through LSTM
lstm_output_sequence = model[2].(embedded_sequence)
Flux.reset!(model[2]) # Reset LSTM state after processing

# 3. Take last output
last_lstm_output = model[3](lstm_output_sequence)

# 4. Pass through Dense layer
final_output = model[4](last_lstm_output)

# To train this, you'd typically have batches of sequences.
# DataLoaders from MLUtils.jl (discussed in "Handling Datasets") are essential here.

println("Output shape: ", size(final_output)) # Should be (output_size, 1)

This example shows one way to piece things together. The exact handling of sequence inputs and outputs (e.g., taking only the last hidden state versus using all hidden states) depends on the specific task. For sequence-to-sequence tasks (like machine translation), the architecture would be more complex, often involving an encoder-decoder structure.

When working with these recurrent layers in Flux, remember:

Input Shape: Pay close attention to the expected input shape. For layers like Flux.LSTM or Flux.GRU processing full sequences, input is often a Vector where each element is a matrix of size (features, batch_size) representing one time step.
State Management: Flux.reset!(layer) is important to clear the hidden state between processing independent sequences (e.g., between batches).
Data Iteration: MLUtils.jl will be your friend for batching and iterating over sequence data efficiently.

RNNs, LSTMs, and GRUs are powerful tools for modeling sequential patterns. While LSTMs and GRUs are generally preferred over vanilla RNNs due to their ability to handle longer dependencies, understanding the basic recurrent mechanism is fundamental. As you progress, you'll encounter variations and more advanced architectures like Transformers, but these gated recurrent units remain important components in the deep learning toolkit for sequential data.

Was this section helpful?