After convolution layers extract features from the input, they produce feature maps. These maps, while informative, often retain a high spatial resolution (width and height). Processing these large maps in subsequent layers can be computationally expensive and may make the network overly sensitive to the precise location of features within the input. Pooling layers offer a way to address this by systematically reducing the spatial dimensions of the feature maps.
Pooling, also known as subsampling or downsampling, summarizes information within local regions of a feature map. It helps create a representation that is more compact and slightly more robust to small shifts or distortions in the input image. Think of it as creating a lower-resolution summary that retains the essential characteristics detected by the preceding convolutional layer.
The pooling operation slides a window (often called a pooling filter or kernel) across each input feature map independently. For each position of the window, it aggregates the values within that window into a single output value. The window then slides according to a specified stride, similar to convolution.
Unlike convolutional layers, pooling layers typically have no learnable parameters (weights or biases). The aggregation function (like maximum or average) is fixed. This makes them simpler and computationally cheaper.
The main parameters defining a pooling operation are:
Two primary types of pooling operations are frequently used in CNNs:
Max pooling selects the maximum value from the feature map region covered by the pooling window.
Consider a 4x4 input feature map and a 2x2 max pooling operation with a stride of 2:
Input Feature Map:
[[ 1, 3, 2, 4],
[ 5, 6, 7, 8],
[ 9, 0, 1, 2],
[ 3, 4, 5, 6]]
Pooling Window (2x2), Stride (2)
Top-Left Window: max(1, 3, 5, 6) = 6
Top-Right Window: max(2, 4, 7, 8) = 8
Bottom-Left Window: max(9, 0, 3, 4) = 9
Bottom-Right Window: max(1, 2, 5, 6) = 6
Output Feature Map (2x2):
[[ 6, 8 ],
[ 9, 6 ]]
Average pooling computes the average value of all elements within the pooling window.
Using the same 4x4 input map, a 2x2 average pooling operation with a stride of 2 would yield:
Input Feature Map:
[[ 1, 3, 2, 4],
[ 5, 6, 7, 8],
[ 9, 0, 1, 2],
[ 3, 4, 5, 6]]
Pooling Window (2x2), Stride (2)
Top-Left Window: avg(1, 3, 5, 6) = 15/4 = 3.75
Top-Right Window: avg(2, 4, 7, 8) = 21/4 = 5.25
Bottom-Left Window: avg(9, 0, 3, 4) = 16/4 = 4.00
Bottom-Right Window: avg(1, 2, 5, 6) = 14/4 = 3.50
Output Feature Map (2x2):
[[ 3.75, 5.25 ],
[ 4.00, 3.50 ]]
A variation is global pooling (either Global Max Pooling or Global Average Pooling). Instead of using a small window, it applies the pooling operation across the entire spatial dimension (width and height) of each feature map. If an input feature map has dimensions H×W×C, global pooling produces an output of size 1×1×C, effectively reducing each channel's map to a single number. This is often used near the end of a network to drastically reduce dimensionality before feeding into a final classification layer.
Here's a small example illustrating a 2x2 max pooling operation with a stride of 2.
Example of a 2x2 max pooling operation with a stride of 2 applied to a 4x4 input feature map. Each value in the output map corresponds to the maximum value in a 2x2 non-overlapping region of the input.
Pooling layers are typically inserted between successive convolutional layers (usually after the activation function applied to the convolution output). A common pattern in CNNs is:
CONV -> ReLU -> POOL -> CONV -> ReLU -> POOL -> ... -> Flatten -> Fully Connected -> Output
This structure allows the network to first learn hierarchical features (CONV + ReLU) and then periodically reduce the spatial resolution while retaining important information (POOL). This reduction helps manage computational load and build some tolerance to variations in feature positioning.
In summary, pooling is a fundamental operation in CNNs that reduces the spatial dimensions of feature maps, decreases computation, and introduces a degree of translation invariance, complementing the feature extraction role of convolutional layers. Max pooling is the most frequently used variant due to its effectiveness in preserving strong feature activations.
© 2025 ApX Machine Learning