All Courses

Convolutional Neural Networks (CNNs) with Flux

Convolutional Neural Networks, or CNNs, represent a specialized class of neural networks highly effective for processing data with a grid-like topology. While they can be applied to various data types, such as time series (1D grid) or volumetric data (3D grid), their most prominent success has been in the domain of computer vision, analyzing 2D grids of pixels in images. Unlike fully connected networks where every input unit is connected to every output unit in a layer, CNNs use a shared-weight architecture through convolutions, making them more efficient and scalable for high-dimensional inputs like images.

The Convolution Operation

At the center of a CNN lies the convolution operation. Imagine a small window, called a filter or kernel, sliding across the input data (e.g., an image). This filter is typically a small matrix of learnable weights. As it slides, it performs an element-wise multiplication with the part of the input it is currently over, and then sums up the results to produce a single value in an output feature map. This process is repeated across the entire input.

A 2x2 filter operates on the highlighted 2x2 region of the input. The highlighted output '2' is computed as $(1 \times 0) + (0 \times 1) + (2 \times 1) + (1 \times 0)$ . The filter then slides to other regions.

Important parameters for a convolutional layer include:

Filter Size (or Kernel Size): Defines the dimensions of the sliding window (e.g., 3x3, 5x5). Smaller filters capture local features, while larger ones can capture more global patterns relative to their input.
Number of Filters: Each filter learns to detect a different feature. Using multiple filters allows the layer to extract a rich set of features from the input.
Stride: The number of pixels the filter moves at each step. A stride of 1 means the filter moves one pixel at a time. A stride of 2 means it skips one pixel, effectively downsampling the output.
Padding: Often, zeros are added around the border of the input. This allows the filter to process the edges of the input more thoroughly and can control the spatial dimensions of the output feature map. SamePad() in Flux aims to keep the output dimensions similar to the input (given a stride of 1).

In Flux.jl, you define a 2D convolutional layer using Conv((filter_height, filter_width), input_channels => output_channels, activation_function; stride=1, pad=0). For example, a layer with 16 filters of size 3x3, taking a 3-channel (RGB) image as input, and using ReLU activation would be:

using Flux

# 16 filters, each 3x3, operating on an input with 3 channels
# Output will have 16 channels
conv_layer = Conv((3, 3), 3 => 16, relu; stride=1, pad=SamePad())

The input_channels => output_channels pair specifies the depth of the input and the number of filters (which determines the depth of the output). SamePad() is a utility from Flux that helps calculate the padding needed to keep spatial dimensions the same when stride is 1.

Feature Extraction and Hierarchies

Each filter in a convolutional layer learns to detect specific patterns. In early layers of a network processing images, filters might learn to detect simple features like edges, corners, or color blobs. As data passes through subsequent convolutional layers, filters in deeper layers can combine these simpler features to detect more complex patterns, such as textures, parts of objects (a wheel, an eye), or even entire objects. This hierarchical feature learning is a powerful aspect of CNNs. The output of a convolutional layer, after applying all its filters, is a set of 2D arrays called feature maps, where each map corresponds to a specific learned feature.

Downsampling with Pooling Layers

After a convolution and activation, it's common to use a pooling layer. Pooling layers reduce the spatial dimensions (width and height) of the feature maps, which has several benefits:

Reduces Computational Load: Fewer parameters and computations in subsequent layers.
Controls Overfitting: By summarizing features in a neighborhood, it provides a form of regularization.
Introduces Local Translation Invariance: The exact location of a feature becomes less important. For example, if a small feature shifts slightly, a max-pooling operation might still output the same value.

The two most common types of pooling are:

Max Pooling: For each patch in the input feature map, it outputs the maximum value. This tends to retain the strongest activation for a feature.
```
# 2x2 max pooling layer with a stride of 2
max_pool_layer = MaxPool((2, 2); stride=2)
```
Average Pooling: For each patch, it outputs the average value. This provides a smoother downsampling.
```
# 2x2 average pooling layer with a stride of 2
avg_pool_layer = MeanPool((2, 2); stride=2)
```

Typically, pooling windows are 2x2 with a stride of 2, which halves the height and width of the feature maps.

Assembling a CNN Architecture with Flux.jl

You construct CNNs by stacking these layers (convolutional, activation, pooling) sequentially. Flux.jl's Chain makes this straightforward. A common pattern for a block in a CNN is Conv -> Activation Function -> Pool.

Diagram illustrating a common CNN structure. Convolutional and pooling layers extract features, which are then flattened and passed to dense layers for classification.

After several convolutional and pooling layers, the resulting high-level feature maps are typically flattened into a 1D vector. This vector then serves as input to one or more standard Dense (fully connected) layers, which perform the final classification or regression task. The Flux.flatten function is used for this purpose.

Input Data Shape for CNNs in Flux.jl

Flux.jl's 2D convolutional layers (Conv) expect input data to be in a specific 4D format: (Width, Height, Channels, BatchSize).

Width (W): The width of the image in pixels.
Height (H): The height of the image in pixels.
Channels (C): The number of color channels. For grayscale images, this is 1. For RGB color images, this is 3.
BatchSize (N): The number of images processed in a single batch.

For example, a batch of 64 grayscale images of size 28x28 pixels would have the shape (28, 28, 1, 64). A batch of 32 RGB color images of size 128x128 would be (128, 128, 3, 32). Ensuring your data is correctly shaped is an important step before feeding it into a CNN. The MLUtils.jl package, discussed earlier in this chapter, provides utilities for batching data, and you'll often reshape your data arrays to match this WHCN format.

A Simple CNN Example in Flux.jl

Let's define a CNN model in Flux.jl suitable for a task like classifying 28x28 grayscale handwritten digits from the MNIST dataset.

using Flux

# Define image and network parameters
img_width, img_height = 28, 28
input_channels = 1 # Grayscale
num_classes = 10

# Calculate the size of the flattened features after conv/pool layers
# First MaxPool((2,2)) halves dimensions: 28x28 -> 14x14
# Second MaxPool((2,2)) halves dimensions again: 14x14 -> 7x7
# Number of channels after second Conv layer is 16
final_conv_w = img_width ÷ 4
final_conv_h = img_height ÷ 4
channels_after_conv = 16
flattened_size = final_conv_w * final_conv_h * channels_after_conv # 7 * 7 * 16 = 784

# Define the CNN model
model = Chain(
    # First convolutional block
    Conv((5, 5), input_channels => 6, relu; pad=SamePad()), # Input: (28, 28, 1, N) -> Output: (28, 28, 6, N)
    MaxPool((2, 2)),                                      # Output: (14, 14, 6, N)

    # Second convolutional block
    Conv((5, 5), 6 => channels_after_conv, relu; pad=SamePad()), # Output: (14, 14, 16, N)
    MaxPool((2, 2)),                                           # Output: (7, 7, 16, N)

    # Flatten the output of the conv/pool layers
    Flux.flatten,                                              # Output: (784, N)

    # Fully connected layers
    Dense(flattened_size, 120, relu),
    Dense(120, 84, relu),
    Dense(84, num_classes)                                     # Output layer for 10 classes
)

# To test with dummy data (e.g., one 28x28 grayscale image)
dummy_image_batch = rand(Float32, img_width, img_height, input_channels, 1)
output = model(dummy_image_batch)
println("Output shape: ", size(output)) # Should be (10, 1)

In this example, pad=SamePad() ensures the convolutional layers don't alter spatial dimensions, making it easier to track sizes before MaxPool layers, which perform explicit downsampling. Flux.flatten reshapes the 4D tensor from pooling layers into a 2D matrix (features, batch_size) for the Dense layers. The final Dense layer has num_classes units. The activation (like softmax) for multi-class classification is often combined with the loss function (e.g., Flux.logitcrossentropy) for better numerical stability.

This structure, inspired by LeNet, is a common pattern. Understanding these components prepares you for building CNNs for image-related tasks, including the hands-on exercise later in this chapter. CNNs' ability to learn hierarchical features and their efficiency with grid-like data make them indispensable in deep learning.

Was this section helpful?