Convolutional Neural Networks, or CNNs, represent a specialized class of neural networks highly effective for processing data with a grid-like topology. While they can be applied to various data types, such as time series (1D grid) or volumetric data (3D grid), their most prominent success has been in the domain of computer vision, analyzing 2D grids of pixels in images. Unlike fully connected networks where every input unit is connected to every output unit in a layer, CNNs use a shared-weight architecture through convolutions, making them more efficient and scalable for high-dimensional inputs like images.
At the center of a CNN lies the convolution operation. Imagine a small window, called a filter or kernel, sliding across the input data (e.g., an image). This filter is typically a small matrix of learnable weights. As it slides, it performs an element-wise multiplication with the part of the input it is currently over, and then sums up the results to produce a single value in an output feature map. This process is repeated across the entire input.
A 2x2 filter operates on the highlighted 2x2 region of the input. The highlighted output '2' is computed as (1×0)+(0×1)+(2×1)+(1×0). The filter then slides to other regions.
Important parameters for a convolutional layer include:
SamePad()
in Flux aims to keep the output dimensions similar to the input (given a stride of 1).In Flux.jl, you define a 2D convolutional layer using Conv((filter_height, filter_width), input_channels => output_channels, activation_function; stride=1, pad=0)
. For example, a layer with 16 filters of size 3x3, taking a 3-channel (RGB) image as input, and using ReLU activation would be:
using Flux
# 16 filters, each 3x3, operating on an input with 3 channels
# Output will have 16 channels
conv_layer = Conv((3, 3), 3 => 16, relu; stride=1, pad=SamePad())
The input_channels => output_channels
pair specifies the depth of the input and the number of filters (which determines the depth of the output). SamePad()
is a utility from Flux that helps calculate the padding needed to keep spatial dimensions the same when stride is 1.
Each filter in a convolutional layer learns to detect specific patterns. In early layers of a network processing images, filters might learn to detect simple features like edges, corners, or color blobs. As data passes through subsequent convolutional layers, filters in deeper layers can combine these simpler features to detect more complex patterns, such as textures, parts of objects (a wheel, an eye), or even entire objects. This hierarchical feature learning is a powerful aspect of CNNs. The output of a convolutional layer, after applying all its filters, is a set of 2D arrays called feature maps, where each map corresponds to a specific learned feature.
After a convolution and activation, it's common to use a pooling layer. Pooling layers reduce the spatial dimensions (width and height) of the feature maps, which has several benefits:
The two most common types of pooling are:
# 2x2 max pooling layer with a stride of 2
max_pool_layer = MaxPool((2, 2); stride=2)
# 2x2 average pooling layer with a stride of 2
avg_pool_layer = MeanPool((2, 2); stride=2)
Typically, pooling windows are 2x2 with a stride of 2, which halves the height and width of the feature maps.
You construct CNNs by stacking these layers (convolutional, activation, pooling) sequentially. Flux.jl's Chain
makes this straightforward. A common pattern for a block in a CNN is Conv -> Activation Function -> Pool
.
Diagram illustrating a common CNN structure. Convolutional and pooling layers extract features, which are then flattened and passed to dense layers for classification.
After several convolutional and pooling layers, the resulting high-level feature maps are typically flattened into a 1D vector. This vector then serves as input to one or more standard Dense
(fully connected) layers, which perform the final classification or regression task. The Flux.flatten
function is used for this purpose.
Flux.jl's 2D convolutional layers (Conv
) expect input data to be in a specific 4D format: (Width, Height, Channels, BatchSize)
.
For example, a batch of 64 grayscale images of size 28x28 pixels would have the shape (28, 28, 1, 64)
. A batch of 32 RGB color images of size 128x128 would be (128, 128, 3, 32)
. Ensuring your data is correctly shaped is an important step before feeding it into a CNN. The MLUtils.jl
package, discussed earlier in this chapter, provides utilities for batching data, and you'll often reshape your data arrays to match this WHCN format.
Let's define a CNN model in Flux.jl suitable for a task like classifying 28x28 grayscale handwritten digits from the MNIST dataset.
using Flux
# Define image and network parameters
img_width, img_height = 28, 28
input_channels = 1 # Grayscale
num_classes = 10
# Calculate the size of the flattened features after conv/pool layers
# First MaxPool((2,2)) halves dimensions: 28x28 -> 14x14
# Second MaxPool((2,2)) halves dimensions again: 14x14 -> 7x7
# Number of channels after second Conv layer is 16
final_conv_w = img_width ÷ 4
final_conv_h = img_height ÷ 4
channels_after_conv = 16
flattened_size = final_conv_w * final_conv_h * channels_after_conv # 7 * 7 * 16 = 784
# Define the CNN model
model = Chain(
# First convolutional block
Conv((5, 5), input_channels => 6, relu; pad=SamePad()), # Input: (28, 28, 1, N) -> Output: (28, 28, 6, N)
MaxPool((2, 2)), # Output: (14, 14, 6, N)
# Second convolutional block
Conv((5, 5), 6 => channels_after_conv, relu; pad=SamePad()), # Output: (14, 14, 16, N)
MaxPool((2, 2)), # Output: (7, 7, 16, N)
# Flatten the output of the conv/pool layers
Flux.flatten, # Output: (784, N)
# Fully connected layers
Dense(flattened_size, 120, relu),
Dense(120, 84, relu),
Dense(84, num_classes) # Output layer for 10 classes
)
# To test with dummy data (e.g., one 28x28 grayscale image)
dummy_image_batch = rand(Float32, img_width, img_height, input_channels, 1)
output = model(dummy_image_batch)
println("Output shape: ", size(output)) # Should be (10, 1)
In this example, pad=SamePad()
ensures the convolutional layers don't alter spatial dimensions, making it easier to track sizes before MaxPool
layers, which perform explicit downsampling. Flux.flatten
reshapes the 4D tensor from pooling layers into a 2D matrix (features, batch_size) for the Dense
layers. The final Dense
layer has num_classes
units. The activation (like softmax) for multi-class classification is often combined with the loss function (e.g., Flux.logitcrossentropy
) for better numerical stability.
This structure, inspired by LeNet, is a common pattern. Understanding these components prepares you for building CNNs for image-related tasks, including the hands-on exercise later in this chapter. CNNs' ability to learn hierarchical features and their efficiency with grid-like data make them indispensable in deep learning.
Was this section helpful?
© 2025 ApX Machine Learning