Having learned about the individual components like Conv2D
and MaxPooling2D
layers, let's now assemble them into a functional, albeit simple, Convolutional Neural Network architecture. The power of CNNs comes from stacking these layers strategically to create a hierarchy of feature detectors.
A common pattern for CNN architectures, especially for classification tasks, involves two main parts:
Conv2D
and MaxPooling2D
layers responsible for extracting features from the input image. It processes the spatial information and learns representations at different levels of abstraction.Flatten
and one or more Dense
layers placed on top of the convolutional base. It takes the extracted features and performs the final classification task.Let's break down how to build this structure using the Keras Sequential API.
The convolutional base usually starts with a Conv2D
layer that receives the input image. Remember to specify the input_shape
argument in the first layer of your model. This shape typically includes height, width, and the number of color channels (e.g., (32, 32, 3)
for a 32x32 RGB image).
We then alternate Conv2D
and MaxPooling2D
layers. A common practice is to increase the number of filters (the first argument in Conv2D
) in deeper convolutional layers. This allows the network to learn more complex patterns as the spatial dimensions get smaller due to pooling.
Conv2D
Layers: These layers apply convolutional filters to detect local patterns. Using activation functions like ReLU (activation='relu'
) introduces non-linearity.MaxPooling2D
Layers: These layers downsample the feature maps, reducing dimensionality and making the learned features more robust to variations in object position. A typical pool_size=(2, 2)
halves the height and width of the feature map.Here's how you might start building the convolutional base in Keras:
import keras
from keras import layers
# Assuming input images are 32x32 RGB
input_shape = (32, 32, 3)
# Start building the convolutional base
model = keras.Sequential(name="simple_cnn_base")
model.add(layers.Input(shape=input_shape)) # Use Input layer for explicit shape definition
# First Convolutional Block
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
# Second Convolutional Block
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
# (Add more blocks if needed)
print("Output shape after convolutional base:", model.output_shape)
# Example Output shape after convolutional base: (None, 5, 5, 64)
# Note: The exact shape depends on input_shape, padding, strides, and number of blocks.
The output of the convolutional base is a set of feature maps (e.g., shape (None, 5, 5, 64)
), representing high-level features extracted from the input. The None
dimension indicates the batch size, which can vary.
A simplified flow diagram of a convolutional base with two blocks.
The output from the convolutional base is a 3D tensor (height, width, channels). However, standard Dense
layers expect 1D vector inputs. This is where the Flatten
layer comes in. It simply reshapes the multi-dimensional feature maps into a single long vector, discarding spatial structure but preserving the learned feature information.
You add the Flatten
layer directly after the last pooling or convolutional layer of the base:
# Continuing the previous model definition...
model.add(layers.Flatten())
print("Output shape after Flatten:", model.output_shape)
# Example Output shape after Flatten: (None, 1600) (since 5 * 5 * 64 = 1600)
Now that the features are flattened into a 1D vector, we can add one or more Dense
layers to perform the classification.
Dense
layer with a non-linear activation like ReLU. This layer learns combinations of the features extracted by the convolutional base.Dense
layer must have a number of units equal to the number of classes in your classification problem. Its activation function depends on the nature of the classification:
softmax
: For multi-class classification (each input belongs to exactly one class).sigmoid
: For binary classification or multi-label classification (each input can belong to multiple classes).Let's complete our simple CNN architecture for a hypothetical 10-class classification problem (like MNIST or CIFAR-10):
import keras
from keras import layers
# --- Define the full model ---
num_classes = 10
input_shape = (32, 32, 3) # Example for CIFAR-10 like data
model = keras.Sequential(
[
layers.Input(shape=input_shape),
# Convolutional Base
layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
# Can add more Conv/Pool layers here
# Transition to Classifier
layers.Flatten(),
# Classifier Head
layers.Dropout(0.5), # Dropout added for regularization (covered later)
layers.Dense(128, activation="relu"), # Intermediate Dense layer
layers.Dense(num_classes, activation="softmax"), # Output layer
],
name="simple_cnn_classifier",
)
# Display the model's architecture
model.summary()
Running model.summary()
will produce output similar to this (exact numbers depend on layers chosen):
Model: "simple_cnn_classifier"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ conv2d (Conv2D) │ (None, 30, 30, 32) │ 896 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ max_pooling2d (MaxPooling2D) │ (None, 15, 15, 32) │ 0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ conv2d_1 (Conv2D) │ (None, 13, 13, 64) │ 18,496 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ max_pooling2d_1 (MaxPooling2D) │ (None, 6, 6, 64) │ 0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ flatten (Flatten) │ (None, 2304) │ 0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout (Dropout) │ (None, 2304) │ 0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense (Dense) │ (None, 128) │ 295,040 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_1 (Dense) │ (None, 10) │ 1,290 │
└─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 315,722 (1.20 MB)
Trainable params: 315,722 (1.20 MB)
Non-trainable params: 0 (0.00 B)
This summary clearly shows the sequence of layers, the output shape at each stage, and the number of trainable parameters. Notice how the spatial dimensions (height and width) decrease through the convolutional base, while the number of channels (filters) often increases. After flattening, the data flows through standard dense layers for classification.
This structure forms the basis for many successful CNNs used in image recognition. While simple, it incorporates the fundamental ideas of hierarchical feature extraction using convolutions and pooling, followed by classification using dense layers. In the following sections, we'll explore how to prepare image data and train such a network.
© 2025 ApX Machine Learning