All Courses

Working with Embeddings for Sequential Data

When working with sequential data, such as text for natural language processing or time series for forecasting, raw data elements like words or discrete time-steps are not directly consumable by neural networks. These networks require numerical inputs. While one-hot encoding is a straightforward method to convert categorical data into vectors, it often leads to very high-dimensional and sparse representations, especially with large vocabularies (e.g., tens of thousands of unique words). Embedding layers offer a more effective and powerful alternative by transforming these discrete items into dense, lower-dimensional, and, importantly, learned vector representations.

These learned vectors, or embeddings, can capture semantic relationships between items in the vocabulary. For instance, in a well-trained language model, words with similar meanings might have embedding vectors that are close to each other in the vector space. This ability to represent meaning in a compact form is significant for the performance of models dealing with sequential data.

The `Flux.Embedding` Layer

In Flux.jl, the Embedding layer, accessible as Flux.Embedding, is the primary tool for creating these dense vector representations. You typically initialize it by specifying two main arguments:

vocab_size: The total number of unique items in your vocabulary (e.g., the number of unique words if you are processing text).
embed_dim: The desired dimensionality of the embedding vectors. This is a hyperparameter you choose; common values range from 50 to 300, or even higher for very large vocabularies and complex tasks.

Internally, an Embedding layer is essentially a lookup table. It maintains a weight matrix of size vocab_size × embed_dim. When you pass an integer (representing the index of an item from your vocabulary) to this layer, it looks up the corresponding row in this matrix, and that row becomes the embedding vector for the input item. If you pass a sequence of integers, it returns a sequence of corresponding embedding vectors.

Let's see a simple example:

using Flux

# Assume a vocabulary of 100 unique words
# We want to represent each word as a 10-dimensional vector
vocab_size = 100
embed_dim = 10

# Create the embedding layer
embedding_layer = Flux.Embedding(vocab_size, embed_dim)

# Example: Get embeddings for words with indices 5, 12, and 1
# Input should be integers or an array of integers
input_indices = [5, 12, 1]
output_vectors = embedding_layer(input_indices)

println("Shape of output: ", size(output_vectors))
# Expected output: Shape of output: (10, 3)
# This means we get three 10-dimensional vectors, one for each input index.
# Flux typically outputs features in the first dimension.

The weights of this lookup table (the embedding matrix) are initialized randomly, just like weights in other neural network layers. During the training process, these weights are adjusted via backpropagation to minimize the loss function of the overall model. This allows the network to learn meaningful representations for the items in the vocabulary based on the task at hand.

Preparing Data for Embedding Layers

Before you can use an Embedding layer, your sequential data (e.g., sentences) needs to be converted into a sequence of integer indices. This process generally involves:

Tokenization: Breaking down the input sequence into individual units or "tokens." For text, this often means splitting sentences into words or sub-word units.
Vocabulary Creation: Building a dictionary (or a mapping) that assigns a unique integer index to each token in your training dataset. It's common practice to reserve an index for unknown tokens (Out-Of-Vocabulary or OOV tokens) and potentially another for padding.
Integer Encoding: Replacing each token in your input sequences with its corresponding integer index from the vocabulary.

For instance, the sentence "julia is fast" might be tokenized into ["julia", "is", "fast"]. If your vocabulary maps "julia" to 7, "is" to 4, and "fast" to 22, then the integer-encoded sequence would be [7, 4, 22]. This is the type of input the Embedding layer expects.

Integrating Embeddings into Network Architectures

Embedding layers are most commonly used as the first layer in networks that process sequential data, particularly Recurrent Neural Networks (RNNs), LSTMs, or GRUs. The Embedding layer converts the integer-encoded sequence into a sequence of dense vectors, which then serves as the input to the subsequent recurrent layers.

Here's a diagram illustrating this flow:

Input integers are transformed by the Embedding layer into dense vectors, which are then processed by recurrent layers.

In Flux.jl, you would combine these layers using a Chain:

using Flux

vocab_size = 10000 # Example vocabulary size
embed_dim = 128   # Embedding dimension
hidden_dim = 256  # LSTM hidden dimension
output_dim = 10   # Example output dimension (e.g., for 10-class classification)

model = Chain(
    Flux.Embedding(vocab_size, embed_dim),          # Output: (embed_dim, seq_len, batch_size)
    Flux.LSTM(embed_dim, hidden_dim),               # Output: (hidden_dim, seq_len, batch_size)
    # To get a single output vector per sequence for classification,
    # often you'd take the last output of the LSTM.
    # This can be done by a custom layer or function, e.g., x -> x[:, end, :]
    # For simplicity, let's assume we want to classify each time step:
    # If not, you might need a layer here to select the last time step's output
    # e.g., data -> data[:, end, :]
    # Followed by a Dense layer
    x -> reshape(x, hidden_dim, :), # Flatten if LSTM outputs 3D for single sequence step
    Flux.Dense(hidden_dim, output_dim)
)

# Example: a batch of 3 sequences, each of length 5
sample_input_indices = rand(1:vocab_size, 5, 3) # (seq_len, batch_size)
output = model(sample_input_indices)
println("Model output shape: ", size(output))
# Depending on how LSTM output is handled, this might be (output_dim, num_elements)
# or (output_dim, batch_size) if processing only last step.
# For the current model, if seq_len*batch_size elements are processed independently by Dense:
# Model output shape: (10, 15) if reshape(x, hidden_dim, :) is used and LSTM outputs all steps
# or (10, 3) if LSTM outputs only the last hidden state and it's 2D.
# Flux.LSTM by default outputs only the last hidden state if input is a matrix (features x batch)
# If input is a sequence of vectors, it outputs a sequence of vectors.
# For batch of sequences (Matrix{Int}), Embedding output is (embed_dim, seq_len, batch_size)
# Then LSTM output is (hidden_dim, seq_len, batch_size)
# Reshape(x, hidden_dim, :) will flatten it to (hidden_dim, seq_len*batch_size)
# So output will be (output_dim, seq_len*batch_size)

In the example above, Flux.LSTM can handle the sequence of embeddings directly. The exact handling of sequence outputs from LSTMs for downstream tasks (like taking the last output for sequence classification or using all outputs for sequence-to-sequence tasks) depends on the specific problem and is a common pattern you'll encounter. For a classification task based on the entire sequence, you might extract the LSTM's output at the final time step before passing it to a Dense layer.

Benefits and Considerations

Using embeddings offers several advantages:

Dimensionality Reduction: Embedding vectors are dense and have much lower dimensionality compared to sparse one-hot encoded vectors, making subsequent computations more manageable.
Computational Efficiency: Smaller input representations for downstream layers mean fewer parameters and faster training.
Semantic Representation: The learned embeddings can capture meaningful relationships (like similarity, analogy) between items in the vocabulary, which often improves model generalization.
Parameter Sharing: Each unique item in the vocabulary has its own embedding vector, but this vector is used every time that item appears in the input.

When working with embeddings, keep these points in mind:

Embedding Dimension (embed_dim): This is an important hyperparameter. A larger dimension allows for more expressive representations but increases model size and computational cost. Too small a dimension might not capture enough information.
Vocabulary Size (vocab_size): This directly affects the size of the embedding matrix ( $vocab\_size \times embed\_dim$ ). Large vocabularies require more memory.
Out-of-Vocabulary (OOV) Tokens: Data often contains tokens not seen during training. A common strategy is to map all OOV tokens to a special "unknown" token (e.g., <UNK>) which has its own learned embedding.
Pre-trained Embeddings: For common tasks like natural language processing, you can use pre-trained embeddings (e.g., Word2Vec, GloVe, FastText) that were learned from massive text corpora. These can provide a good initialization or even be used as fixed features, especially when your own dataset is small. Flux can accommodate loading these pre-trained weights into an Embedding layer. This topic, along with transfer learning, will be discussed in more detail in Chapter 5.

Embeddings are a fundamental component in modern deep learning models for sequential data. By transforming discrete symbols into meaningful continuous vector spaces, they enable neural networks to process and understand complex sequences effectively. As you build more sophisticated architectures like CNNs for text or advanced RNNs, the Embedding layer will often be your starting point for handling symbolic input.

Was this section helpful?

Working with Embeddings for Sequential Data

The Flux.Embedding Layer

Preparing Data for Embedding Layers

Integrating Embeddings into Network Architectures

Benefits and Considerations

The `Flux.Embedding` Layer