Now that we have examined the core components of the Transformer architecture, specifically Multi-Head Self-Attention and the Position-wise Feed-Forward Network, let's put them together in this practical exercise. We will implement a complete Transformer Encoder Layer using TensorFlow's Keras API. This layer represents a fundamental building block that can be stacked multiple times to form the full encoder component of a Transformer model.
Recall that a single encoder layer performs two main operations:
Each of these operations is followed by a residual connection and layer normalization. Dropout is also applied within the layer for regularization.
We'll create a custom Keras layer by subclassing tf.keras.layers.Layer
. This gives us the flexibility to define the internal structure and the forward pass computation precisely.
import tensorflow as tf
# Assume MultiHeadAttention and PositionWiseFeedForwardNetwork
# classes are defined as covered in previous sections.
# For completeness, here are simplified placeholder definitions:
class MultiHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, **kwargs):
super().__init__(**kwargs)
self.d_model = d_model
self.num_heads = num_heads
# Simplified: In reality, contains Dense layers for Q, K, V, and output
print(f"Placeholder MHA: d_model={d_model}, num_heads={num_heads}")
def call(self, v, k, q, mask=None):
# Simplified: Returns input query as placeholder for attention output
# Actual implementation would compute scaled dot-product attention
print("Placeholder MHA Call")
# Output shape: (batch_size, seq_len_q, d_model)
return q # Placeholder
class PositionWiseFeedForwardNetwork(tf.keras.layers.Layer):
def __init__(self, d_model, dff, **kwargs):
super().__init__(**kwargs)
self.d_model = d_model
self.dff = dff
# Simplified: In reality, contains two Dense layers
print(f"Placeholder FFN: d_model={d_model}, dff={dff}")
def call(self, x):
# Simplified: Returns input as placeholder
print("Placeholder FFN Call")
# Output shape: (batch_size, seq_len, d_model)
return x # Placeholder
# --- Actual Encoder Layer Implementation ---
class TransformerEncoderLayer(tf.keras.layers.Layer):
"""
Implements a single Transformer Encoder layer with Multi-Head Attention,
Feed Forward Network, Layer Normalization, and Dropout.
"""
def __init__(self, d_model, num_heads, dff, dropout_rate=0.1, **kwargs):
"""
Initializes the Transformer Encoder Layer.
Args:
d_model: Dimensionality of the input and output (embedding dimension).
num_heads: Number of attention heads.
dff: Dimensionality of the inner-layer in the Feed Forward Network.
dropout_rate: Float between 0 and 1. Fraction of the units to drop.
"""
super().__init__(**kwargs)
self.d_model = d_model
self.num_heads = num_heads
self.dff = dff
self.dropout_rate = dropout_rate
# Multi-Head Attention sub-layer
self.mha = MultiHeadAttention(d_model, num_heads)
# Position-wise Feed Forward Network sub-layer
self.ffn = PositionWiseFeedForwardNetwork(d_model, dff)
# Layer Normalization layers
# Epsilon added for numerical stability
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
# Dropout layers
self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
def call(self, x, training, mask=None):
"""
Forward pass for the Transformer Encoder Layer.
Args:
x: Input tensor. Shape: (batch_size, input_seq_len, d_model)
training: Boolean indicating if the layer should behave in training mode
(apply dropout) or inference mode.
mask: Optional mask for the attention mechanism.
Returns:
Output tensor. Shape: (batch_size, input_seq_len, d_model)
"""
# 1. Multi-Head Attention (with residual connection and normalization)
# Self-attention: query, key, and value are all the same input 'x'
attn_output = self.mha(x, x, x, mask) # Shape: (batch_size, input_seq_len, d_model)
# Apply dropout to the attention output
# Dropout is only applied during training
attn_output = self.dropout1(attn_output, training=training)
# Add residual connection and apply layer normalization
# out1 = x + attn_output
out1 = self.layernorm1(x + attn_output) # Shape: (batch_size, input_seq_len, d_model)
# 2. Feed Forward Network (with residual connection and normalization)
ffn_output = self.ffn(out1) # Shape: (batch_size, input_seq_len, d_model)
# Apply dropout to the FFN output
ffn_output = self.dropout2(ffn_output, training=training)
# Add residual connection and apply layer normalization
# out2 = out1 + ffn_output
out2 = self.layernorm2(out1 + ffn_output) # Shape: (batch_size, input_seq_len, d_model)
return out2
def get_config(self):
"""Serializes the layer configuration."""
config = super().get_config()
config.update({
'd_model': self.d_model,
'num_heads': self.num_heads,
'dff': self.dff,
'dropout_rate': self.dropout_rate
})
return config
In the __init__
method, we instantiate the necessary sub-layers: MultiHeadAttention
, PositionWiseFeedForwardNetwork
, two LayerNormalization
layers, and two Dropout
layers. The hyperparameters d_model
, num_heads
, dff
, and dropout_rate
control the behavior and capacity of the layer.
The call
method defines the computation flow.
x
is passed through the MultiHeadAttention
layer. Since this is self-attention within the encoder, the query, key, and value inputs to the attention mechanism are all the same tensor x
. Any necessary padding mask is passed along.training
argument, which ensures dropout is only active during model training.x + attn_output
), followed by layer normalization (self.layernorm1
).out1
) is passed through the PositionWiseFeedForwardNetwork
.self.dropout2
).out1 + ffn_output
), followed by the second layer normalization (self.layernorm2
).out2
has the same shape as the input x
.We also include a get_config
method, which is good practice for custom Keras layers, allowing the layer to be easily saved and loaded.
The following diagram illustrates the data flow within the TransformerEncoderLayer
.
Data flow within a single Transformer Encoder Layer, showing the Multi-Head Attention and Feed-Forward sub-layers, each followed by Dropout, a residual connection (Add), and Layer Normalization.
Let's create an instance of our TransformerEncoderLayer
and pass some sample data through it to verify its operation and output shape.
# Define hyperparameters
batch_size = 64
input_seq_len = 50
d_model = 512 # Embedding dimension
num_heads = 8 # Number of attention heads
dff = 2048 # Hidden layer size in FFN
dropout_rate = 0.1
# Create a sample input tensor (e.g., sequence embeddings)
# Replace with actual data in a real scenario
sample_input = tf.random.uniform((batch_size, input_seq_len, d_model))
# Instantiate the encoder layer
# Note: Using the actual MHA and FFN implementations is needed for real results
# The placeholder versions used above will just print messages and pass data through.
# Assuming you have functional MHA and FFN classes available:
# encoder_layer = TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate)
# For demonstration with placeholders:
print("Instantiating Encoder Layer with Placeholders:")
encoder_layer = TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate, name="my_encoder_layer")
print("-" * 20)
# Pass the sample input through the layer
# Set training=False for inference mode (no dropout)
print("Running Encoder Layer Call (training=False):")
output_tensor = encoder_layer(sample_input, training=False)
print("-" * 20)
# Pass the sample input through the layer in training mode
print("Running Encoder Layer Call (training=True):")
output_tensor_train = encoder_layer(sample_input, training=True)
print("-" * 20)
# Check the output shape
print(f"Input shape: {sample_input.shape}")
print(f"Output shape (inference): {output_tensor.shape}")
print(f"Output shape (training): {output_tensor_train.shape}")
# Verify output shape matches input shape (excluding batch size potentially)
assert sample_input.shape == output_tensor.shape
assert sample_input.shape == output_tensor_train.shape
print("\nEncoder layer created and tested successfully.")
# You can also inspect the layer's configuration
print("\nLayer Configuration:")
print(encoder_layer.get_config())
Running this code (assuming functional MultiHeadAttention
and PositionWiseFeedForwardNetwork
classes) will instantiate the encoder layer and process the sample input. The output shape should match the input shape (batch_size, input_seq_len, d_model)
, confirming that the layer processes the sequence while maintaining the dimensionality required for stacking multiple layers. You'll also see the placeholder messages if using the simplified versions shown earlier.
This hands-on implementation provides a concrete understanding of how the different components integrate within a Transformer Encoder Layer. Typically, a full Transformer Encoder consists of multiple instances of this layer stacked sequentially, where the output of one layer becomes the input to the next.
© 2025 ApX Machine Learning