When working with sequential data, such as text for natural language processing or time series for forecasting, raw data elements like words or discrete time-steps are not directly consumable by neural networks. These networks require numerical inputs. While one-hot encoding is a straightforward method to convert categorical data into vectors, it often leads to very high-dimensional and sparse representations, especially with large vocabularies (e.g., tens of thousands of unique words). Embedding layers offer a more effective and powerful alternative by transforming these discrete items into dense, lower-dimensional, and, importantly, learned vector representations.
These learned vectors, or embeddings, can capture semantic relationships between items in the vocabulary. For instance, in a well-trained language model, words with similar meanings might have embedding vectors that are close to each other in the vector space. This ability to represent meaning in a compact form is significant for the performance of models dealing with sequential data.
Flux.Embedding
LayerIn Flux.jl, the Embedding
layer, accessible as Flux.Embedding
, is the primary tool for creating these dense vector representations. You typically initialize it by specifying two main arguments:
vocab_size
: The total number of unique items in your vocabulary (e.g., the number of unique words if you are processing text).embed_dim
: The desired dimensionality of the embedding vectors. This is a hyperparameter you choose; common values range from 50 to 300, or even higher for very large vocabularies and complex tasks.Internally, an Embedding
layer is essentially a lookup table. It maintains a weight matrix of size vocab_size × embed_dim
. When you pass an integer (representing the index of an item from your vocabulary) to this layer, it looks up the corresponding row in this matrix, and that row becomes the embedding vector for the input item. If you pass a sequence of integers, it returns a sequence of corresponding embedding vectors.
Let's see a simple example:
using Flux
# Assume a vocabulary of 100 unique words
# We want to represent each word as a 10-dimensional vector
vocab_size = 100
embed_dim = 10
# Create the embedding layer
embedding_layer = Flux.Embedding(vocab_size, embed_dim)
# Example: Get embeddings for words with indices 5, 12, and 1
# Input should be integers or an array of integers
input_indices = [5, 12, 1]
output_vectors = embedding_layer(input_indices)
println("Shape of output: ", size(output_vectors))
# Expected output: Shape of output: (10, 3)
# This means we get three 10-dimensional vectors, one for each input index.
# Flux typically outputs features in the first dimension.
The weights of this lookup table (the embedding matrix) are initialized randomly, just like weights in other neural network layers. During the training process, these weights are adjusted via backpropagation to minimize the loss function of the overall model. This allows the network to learn meaningful representations for the items in the vocabulary based on the task at hand.
Before you can use an Embedding
layer, your sequential data (e.g., sentences) needs to be converted into a sequence of integer indices. This process generally involves:
For instance, the sentence "julia is fast" might be tokenized into ["julia", "is", "fast"]
. If your vocabulary maps "julia" to 7, "is" to 4, and "fast" to 22, then the integer-encoded sequence would be [7, 4, 22]
. This is the type of input the Embedding
layer expects.
Embedding layers are most commonly used as the first layer in networks that process sequential data, particularly Recurrent Neural Networks (RNNs), LSTMs, or GRUs. The Embedding
layer converts the integer-encoded sequence into a sequence of dense vectors, which then serves as the input to the subsequent recurrent layers.
Here's a diagram illustrating this flow:
Input integers are transformed by the
Embedding
layer into dense vectors, which are then processed by recurrent layers.
In Flux.jl, you would combine these layers using a Chain
:
using Flux
vocab_size = 10000 # Example vocabulary size
embed_dim = 128 # Embedding dimension
hidden_dim = 256 # LSTM hidden dimension
output_dim = 10 # Example output dimension (e.g., for 10-class classification)
model = Chain(
Flux.Embedding(vocab_size, embed_dim), # Output: (embed_dim, seq_len, batch_size)
Flux.LSTM(embed_dim, hidden_dim), # Output: (hidden_dim, seq_len, batch_size)
# To get a single output vector per sequence for classification,
# often you'd take the last output of the LSTM.
# This can be done by a custom layer or function, e.g., x -> x[:, end, :]
# For simplicity, let's assume we want to classify each time step:
# If not, you might need a layer here to select the last time step's output
# e.g., data -> data[:, end, :]
# Followed by a Dense layer
x -> reshape(x, hidden_dim, :), # Flatten if LSTM outputs 3D for single sequence step
Flux.Dense(hidden_dim, output_dim)
)
# Example: a batch of 3 sequences, each of length 5
sample_input_indices = rand(1:vocab_size, 5, 3) # (seq_len, batch_size)
output = model(sample_input_indices)
println("Model output shape: ", size(output))
# Depending on how LSTM output is handled, this might be (output_dim, num_elements)
# or (output_dim, batch_size) if processing only last step.
# For the current model, if seq_len*batch_size elements are processed independently by Dense:
# Model output shape: (10, 15) if reshape(x, hidden_dim, :) is used and LSTM outputs all steps
# or (10, 3) if LSTM outputs only the last hidden state and it's 2D.
# Flux.LSTM by default outputs only the last hidden state if input is a matrix (features x batch)
# If input is a sequence of vectors, it outputs a sequence of vectors.
# For batch of sequences (Matrix{Int}), Embedding output is (embed_dim, seq_len, batch_size)
# Then LSTM output is (hidden_dim, seq_len, batch_size)
# Reshape(x, hidden_dim, :) will flatten it to (hidden_dim, seq_len*batch_size)
# So output will be (output_dim, seq_len*batch_size)
In the example above, Flux.LSTM
can handle the sequence of embeddings directly. The exact handling of sequence outputs from LSTMs for downstream tasks (like taking the last output for sequence classification or using all outputs for sequence-to-sequence tasks) depends on the specific problem and is a common pattern you'll encounter. For a classification task based on the entire sequence, you might extract the LSTM's output at the final time step before passing it to a Dense
layer.
Using embeddings offers several advantages:
When working with embeddings, keep these points in mind:
embed_dim
): This is an important hyperparameter. A larger dimension allows for more expressive representations but increases model size and computational cost. Too small a dimension might not capture enough information.vocab_size
): This directly affects the size of the embedding matrix (vocab_size×embed_dim). Large vocabularies require more memory.<UNK>
) which has its own learned embedding.Embedding
layer. This topic, along with transfer learning, will be discussed in more detail in Chapter 5.Embeddings are a fundamental component in modern deep learning models for sequential data. By transforming discrete symbols into meaningful continuous vector spaces, they enable neural networks to process and understand complex sequences effectively. As you build more sophisticated architectures like CNNs for text or advanced RNNs, the Embedding
layer will often be your starting point for handling symbolic input.
Was this section helpful?
© 2025 ApX Machine Learning