A Transformer encoder layer includes multi-head self-attention for capturing contextual relationships, position-wise feed-forward networks for transforming representations, and the essential Add & Norm steps for stabilization and gradient flow. These components are then assembled into a functional encoder layer using Python and a deep learning framework like PyTorch or TensorFlow.This practical exercise assumes you have access to implementations of Multi-Head Attention (perhaps built in the previous chapter's practical section) and understand basic class definition and tensor operations within your chosen framework. We'll focus on structuring the EncoderLayer module itself.Core Components RecapBefore we write the code, let's quickly recall the sub-layers within a single encoder layer:Multi-Head Self-Attention: Takes the input sequence embeddings and computes attention scores between all pairs of positions, producing an output where each position's representation is influenced by others based on relevance.Add & Norm (1): Adds the original input to the output of the self-attention layer (residual connection) and then applies layer normalization. This helps mitigate vanishing gradients and stabilizes activations.Position-wise Feed-Forward Network (FFN): A simple network (usually two linear layers with a ReLU or GELU activation in between) applied independently to each position in the sequence. It further processes the representations.Add & Norm (2): Adds the input to the FFN to the output of the FFN (another residual connection) and applies layer normalization again.Dropout is also typically applied after the multi-head attention output and after the FFN output to prevent overfitting during training.Defining the Position-wise Feed-Forward NetworkThis is a straightforward component. It consists of two linear transformations with a non-linear activation function in between. Often, the first linear layer expands the dimension, and the second compresses it back to the original model dimension ($d_{model}$). A common expansion factor is 4.Let's represent this (using PyTorch-like syntax):import torch import torch.nn as nn class PositionWiseFeedForward(nn.Module): def __init__(self, d_model, d_ff, dropout=0.1): """ Initializes the Position-wise Feed-Forward Network. Args: d_model (int): Dimensionality of the input and output. d_ff (int): Dimensionality of the inner layer. dropout (float): Dropout probability. """ super().__init__() self.linear_1 = nn.Linear(d_model, d_ff) self.activation = nn.ReLU() # Or nn.GELU() self.dropout = nn.Dropout(dropout) self.linear_2 = nn.Linear(d_ff, d_model) def forward(self, x): """ Forward pass through the FFN. Args: x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model). Returns: torch.Tensor: Output tensor of shape (batch_size, seq_len, d_model). """ x = self.linear_1(x) x = self.activation(x) x = self.dropout(x) # Dropout often applied after activation x = self.linear_2(x) return x This PositionWiseFeedForward module takes a tensor of shape (batch_size, seq_len, d_model) and returns a tensor of the same shape, having applied the transformations independently at each sequence position.Assembling the Encoder LayerNow we combine the Multi-Head Self-Attention (which we'll assume is available as a module named MultiHeadAttention), the PositionWiseFeedForward network defined above, Layer Normalization, residual connections, and dropout into a single EncoderLayer.class EncoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): """ Initializes a single Transformer Encoder Layer. Args: d_model (int): The dimensionality of the input/output features (embeddings). num_heads (int): The number of attention heads. d_ff (int): The inner dimension of the feed-forward network. dropout (float): The dropout probability. """ super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) # Assumed implementation self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout1 = nn.Dropout(dropout) self.dropout2 = nn.Dropout(dropout) def forward(self, x, mask): """ Forward pass through the Encoder Layer. Args: x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model). mask (torch.Tensor): Attention mask (optional, for padding). Shape typically (batch_size, 1, 1, seq_len) or similar depending on MultiHeadAttention implementation. Returns: torch.Tensor: Output tensor of shape (batch_size, seq_len, d_model). """ # 1. Multi-Head Self-Attention + Add & Norm attn_output = self.self_attn(query=x, key=x, value=x, mask=mask) # Residual connection 1: Add input 'x' to attention output # Apply dropout to the attention output before adding x = self.norm1(x + self.dropout1(attn_output)) # 2. Position-wise Feed-Forward Network + Add & Norm ff_output = self.feed_forward(x) # Residual connection 2: Add input to FFN ('x') to FFN output # Apply dropout to the FFN output before adding x = self.norm2(x + self.dropout2(ff_output)) return x In this EncoderLayer, the input x first goes through the multi-head self-attention mechanism. The output of the attention is then passed through dropout (dropout1), added back to the original input x (the first residual connection), and the sum is normalized (norm1). This normalized output then serves as the input to the position-wise feed-forward network. The output of the FFN goes through dropout (dropout2), is added to the input that went into the FFN (the second residual connection), and this sum is normalized (norm2) to produce the final output of the encoder layer.Visualizing the Encoder Layer FlowThe following diagram illustrates the data flow within the EncoderLayer we just defined.digraph EncoderLayer { rankdir=TB; node [shape=box, style="filled", fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; subgraph cluster_0 { style=filled; fillcolor="#f8f9fa"; label = "Encoder Layer"; rankdir=TB; Input [label="Input (x)", shape=ellipse, fillcolor="#a5d8ff"]; MHA [label="Multi-Head\nSelf-Attention", fillcolor="#bac8ff"]; Dropout1 [label="Dropout", fillcolor="#ffc9c9"]; Add1 [label="+", shape=circle, fillcolor="#b2f2bb", width=0.3, height=0.3, fixedsize=true]; Norm1 [label="Layer Normalization", fillcolor="#ffec99"]; FFN [label="Position-wise\nFeed-Forward", fillcolor="#bac8ff"]; Dropout2 [label="Dropout", fillcolor="#ffc9c9"]; Add2 [label="+", shape=circle, fillcolor="#b2f2bb", width=0.3, height=0.3, fixedsize=true]; Norm2 [label="Layer Normalization", fillcolor="#ffec99"]; Output [label="Output", shape=ellipse, fillcolor="#a5d8ff"]; # Connections Input -> MHA [label=" Q, K, V"]; Input -> Add1 [headport="w", minlen=2]; MHA -> Dropout1; Dropout1 -> Add1; Add1 -> Norm1; Norm1 -> FFN; Norm1 -> Add2 [headport="w", minlen=2]; FFN -> Dropout2; Dropout2 -> Add2; Add2 -> Norm2; Norm2 -> Output; } }Data flow within a single Transformer Encoder Layer, showing the Multi-Head Attention and Feed-Forward sub-layers, each followed by Dropout, a residual connection (Add), and Layer Normalization.This structure, repeated multiple times (typically 6 or 12 times in the original Transformer paper), forms the complete Encoder stack. Each layer takes the output of the previous layer as its input, allowing the model to build increasingly complex representations of the input sequence.You now have a practical implementation of a core Transformer component. In the next chapter, we'll look at how to train these models effectively.