Having learned about the individual components like Conv2D and MaxPooling2D layers, let's now assemble them into a functional, albeit simple, Convolutional Neural Network architecture. The power of CNNs comes from stacking these layers strategically to create a hierarchy of feature detectors.

A common pattern for CNN architectures, especially for classification tasks, involves two main parts:

The Convolutional Base: A series of Conv2D and MaxPooling2D layers responsible for extracting features from the input image. It processes the spatial information and learns representations at different levels of abstraction.
The Classifier Head: Typically composed of Flatten and one or more Dense layers placed on top of the convolutional base. It takes the extracted features and performs the final classification task.

Let's break down how to build this structure using the Keras Sequential API.

Constructing the Convolutional Base

The convolutional base usually starts with a Conv2D layer that receives the input image. Remember to specify the input_shape argument in the first layer of your model. This shape typically includes height, width, and the number of color channels (e.g., (32, 32, 3) for a 32x32 RGB image).

We then alternate Conv2D and MaxPooling2D layers. A common practice is to increase the number of filters (the first argument in Conv2D) in deeper convolutional layers. This allows the network to learn more complex patterns as the spatial dimensions get smaller due to pooling.

Conv2D Layers: These layers apply convolutional filters to detect local patterns. Using activation functions like ReLU (activation='relu') introduces non-linearity.
MaxPooling2D Layers: These layers downsample the feature maps, reducing dimensionality and making the learned features more robust to variations in object position. A typical pool_size=(2, 2) halves the height and width of the feature map.

Here's how you might start building the convolutional base in Keras:

import keras
from keras import layers

# Assuming input images are 32x32 RGB
input_shape = (32, 32, 3)

# Start building the convolutional base
model = keras.Sequential(name="simple_cnn_base")
model.add(layers.Input(shape=input_shape)) # Use Input layer for explicit shape definition

# First Convolutional Block
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))

# Second Convolutional Block
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))

# (Add more blocks if needed)

print("Output shape after convolutional base:", model.output_shape)
# Example Output shape after convolutional base: (None, 5, 5, 64)
# Note: The exact shape depends on input_shape, padding, strides, and number of blocks.

The output of the convolutional base is a set of feature maps (e.g., shape (None, 5, 5, 64)), representing high-level features extracted from the input. The None dimension indicates the batch size, which can vary.

A simplified flow diagram of a convolutional base with two blocks.

Transitioning with the Flatten Layer

The output from the convolutional base is a 3D tensor (height, width, channels). However, standard Dense layers expect 1D vector inputs. This is where the Flatten layer comes in. It simply reshapes the multi-dimensional feature maps into a single long vector, discarding spatial structure but preserving the learned feature information.

You add the Flatten layer directly after the last pooling or convolutional layer of the base:

# Continuing the previous model definition...

model.add(layers.Flatten())

print("Output shape after Flatten:", model.output_shape)
# Example Output shape after Flatten: (None, 1600)  (since 5 * 5 * 64 = 1600)

Adding the Classifier Head

Now that the features are flattened into a 1D vector, we can add one or more Dense layers to perform the classification.

A common approach is to add at least one intermediate Dense layer with a non-linear activation like ReLU. This layer learns combinations of the features extracted by the convolutional base.
The final Dense layer must have a number of units equal to the number of classes in your classification problem. Its activation function depends on the nature of the classification:
- softmax: For multi-class classification (each input belongs to exactly one class).
- sigmoid: For binary classification or multi-label classification (each input can belong to multiple classes).

Let's complete our simple CNN architecture for a hypothetical 10-class classification problem (like MNIST or CIFAR-10):

import keras
from keras import layers

# --- Define the full model ---
num_classes = 10
input_shape = (32, 32, 3) # Example for CIFAR-10 like data

model = keras.Sequential(
    [
        layers.Input(shape=input_shape),

        # Convolutional Base
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        # Can add more Conv/Pool layers here

        # Transition to Classifier
        layers.Flatten(),

        # Classifier Head
        layers.Dropout(0.5), # Dropout added for regularization (covered later)
        layers.Dense(128, activation="relu"), # Intermediate Dense layer
        layers.Dense(num_classes, activation="softmax"), # Output layer
    ],
    name="simple_cnn_classifier",
)

# Display the model's architecture
model.summary()

Running model.summary() will produce output similar to this (exact numbers depend on layers chosen):

Model: "simple_cnn_classifier"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape              ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ conv2d (Conv2D)                 │ (None, 30, 30, 32)        │        896 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ max_pooling2d (MaxPooling2D)    │ (None, 15, 15, 32)        │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ conv2d_1 (Conv2D)               │ (None, 13, 13, 64)        │     18,496 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ max_pooling2d_1 (MaxPooling2D)  │ (None, 6, 6, 64)          │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ flatten (Flatten)               │ (None, 2304)              │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout (Dropout)               │ (None, 2304)              │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense (Dense)                   │ (None, 128)               │    295,040 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_1 (Dense)                 │ (None, 10)                │      1,290 │
└─────────────────────────────────┴───────────────────────────┴────────────┘
 Total params: 315,722 (1.20 MB)
 Trainable params: 315,722 (1.20 MB)
 Non-trainable params: 0 (0.00 B)

This summary clearly shows the sequence of layers, the output shape at each stage, and the number of trainable parameters. Notice how the spatial dimensions (height and width) decrease through the convolutional base, while the number of channels (filters) often increases. After flattening, the data flows through standard dense layers for classification.

This structure forms the basis for many successful CNNs used in image recognition. While simple, it incorporates the fundamental ideas of hierarchical feature extraction using convolutions and pooling, followed by classification using dense layers. In the following sections, we'll explore how to prepare image data and train such a network.