This practical exercise helps solidify understanding of the encoder, decoder, bottleneck, and the reconstruction loss. The popular MNIST dataset, consisting of grayscale images of handwritten digits, will be used. The goal is to train an autoencoder to compress these images into a lower-dimensional representation and then reconstruct them.Setting the Stage: Environment and DataBefore we begin, ensure you have your deep learning environment ready. For this example, we'll outline steps assuming a PyTorch setup. You'll primarily need torch and torchvision for data loading and model building, numpy for numerical operations, and matplotlib for visualizing our results.1. Importing Libraries First, let's import the necessary libraries.import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms import numpy as np import matplotlib.pyplot as plt2. Loading and Preparing the MNIST Dataset The MNIST dataset is conveniently available through torchvision.datasets. Each image is 28x28 pixels. For this basic autoencoder, we'll flatten these 28x28 images into vectors of 784 pixels. We also need to normalize the pixel values, typically to a range between 0 and 1, which helps with training stability. PyTorch's transforms.ToTensor() handles scaling pixels to [0, 1] automatically.# Define a transform to normalize the data and flatten images transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)), # Normalize to [-1, 1] for better training with tanh (optional, but common) transforms.Lambda(lambda x: x.view(-1)) # Flatten the 28x28 image to 784 ]) # Load the MNIST dataset train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform) test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform) # Create data loaders batch_size = 256 train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False) # Get a sample to check shape (optional) sample_data, _ = next(iter(train_loader)) print(f"Sample x_train batch shape: {sample_data.shape}") print(f"Flattened image size: {sample_data.shape[1]}")You'll notice we are only loading the image data and ignoring the labels (_). This is because autoencoders are trained in an unsupervised manner; their goal is to reconstruct the input, not to predict a label. The flattened image size should be 784.Designing Our Basic AutoencoderAn autoencoder consists of two main parts: an encoder and a decoder. The encoder maps the input data to a lower-dimensional representation (the bottleneck), and the decoder attempts to reconstruct the original input from this representation.1. Defining the Architecture Let's define a simple architecture using PyTorch's nn.Module. We'll use nn.Linear layers.Input Layer: This will match the shape of our flattened MNIST images (784 features).Encoder: A sequence of nn.Linear layers that progressively reduce the dimensionality. For example, 784 -> 128 -> 64.Bottleneck Layer: This is the smallest layer in our network, representing the compressed latent space. Let's choose a dimensionality of 32 for this example. This is significantly smaller than the input 784, making it an undercomplete autoencoder.Decoder: A sequence of nn.Linear layers that progressively increase the dimensionality, mirroring the encoder. For example, 32 -> 64 -> 128 -> 784.Output Layer: This layer should have the same number of units as the input (784) and use an activation function suitable for reconstructing the normalized pixel values (e.g., nn.Sigmoid for [0,1] or nn.Tanh for [-1,1] if you normalized to that range).Here's how we can define it:class Autoencoder(nn.Module): def __init__(self, latent_dim=32): super(Autoencoder, self).__init__() self.latent_dim = latent_dim # Encoder self.encoder = nn.Sequential( nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, latent_dim), nn.ReLU() # The bottleneck layer ) # Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, 64), nn.ReLU(), nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 784), nn.Tanh() # Use Tanh if input was normalized to [-1, 1], else use Sigmoid for [0, 1] ) def forward(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded latent_dim = 32 autoencoder = Autoencoder(latent_dim) # Move model to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") autoencoder.to(device) print(autoencoder)Below is a diagram illustrating the general structure of our autoencoder:digraph G { rankdir=TB; node [shape=box, style="filled,rounded", fontname="sans-serif", fillcolor="#a5d8ff"]; edge [fontname="sans-serif"]; Input [label="Input Data\n(784 features)"]; Encoder_Layers [label="Encoder\n(Linear 128 units, ReLU)\n(Linear 64 units, ReLU)"]; Bottleneck [label="Bottleneck Layer\n(Latent Representation, 32 features, ReLU)", fillcolor="#ffd8a8"]; Decoder_Layers [label="Decoder\n(Linear 64 units, ReLU)\n(Linear 128 units, ReLU)"]; Output [label="Reconstructed Data\n(784 features, Tanh)"]; Input -> Encoder_Layers; Encoder_Layers -> Bottleneck [label="Compression"]; Bottleneck -> Decoder_Layers; Decoder_Layers -> Output [label="Reconstruction"]; {rank=same; Encoder_Layers; Decoder_Layers;} }The flow of data through the autoencoder, from input, through compression in the encoder and bottleneck, to reconstruction by the decoder.2. Defining Loss Function and Optimizer Before training, we need to define the loss function and the optimizer. As discussed in the chapter, Mean Squared Error (MSE) is a common choice for reconstruction loss when dealing with continuous data like our normalized pixel values.$$MSE = \frac{1}{N} \sum_{i=1}^{N} (x_i - \hat{x}_i)^2$$Here, $x_i$ is the original input and $\hat{x}_i$ is the reconstructed output. We'll use the Adam optimizer, which is a popular and effective choice for many deep learning tasks.criterion = nn.MSELoss() optimizer = optim.Adam(autoencoder.parameters(), lr=1e-3)Training the AutoencoderNow, we train the autoencoder. The distinctive aspect here is that the input data serves as both the input and the target output. The network learns to reconstruct what it's given.num_epochs = 50 train_losses = [] val_losses = [] for epoch in range(num_epochs): # Training phase autoencoder.train() running_train_loss = 0.0 for data, _ in train_loader: data = data.to(device) optimizer.zero_grad() outputs = autoencoder(data) loss = criterion(outputs, data) loss.backward() optimizer.step() running_train_loss += loss.item() * data.size(0) epoch_train_loss = running_train_loss / len(train_loader.dataset) train_losses.append(epoch_train_loss) # Validation phase autoencoder.eval() running_val_loss = 0.0 with torch.no_grad(): for data, _ in test_loader: data = data.to(device) outputs = autoencoder(data) loss = criterion(outputs, data) running_val_loss += loss.item() * data.size(0) epoch_val_loss = running_val_loss / len(test_loader.dataset) val_losses.append(epoch_val_loss) print(f'Epoch [{epoch+1}/{num_epochs}], ' f'Train Loss: {epoch_train_loss:.4f}, ' f'Validation Loss: {epoch_val_loss:.4f}')We can plot the training and validation loss to see how our model learned:plt.figure(figsize=(10, 5)) plt.plot(train_losses, label='Training Loss') plt.plot(val_losses, label='Validation Loss') plt.title('Model Loss During Training') plt.xlabel('Epoch') plt.ylabel('Loss (MSE)') plt.legend() plt.grid(True) plt.show()Visualizing the ReconstructionsThe true test of our autoencoder is how well it can reconstruct the input images. Let's use the trained autoencoder to predict (reconstruct) the images from our test set and display a few of them alongside the originals.# Reconstruct images from the test set autoencoder.eval() # Set model to evaluation mode with torch.no_grad(): data_iter = iter(test_loader) data, _ = next(data_iter) # Get a batch of test data data = data.to(device) decoded_imgs = autoencoder(data).cpu().numpy() # Get reconstructions and move to CPU # Display original and reconstructed images n = 10 # Number of digits to display plt.figure(figsize=(20, 4)) for i in range(n): # Display original ax = plt.subplot(2, n, i + 1) # Undo normalization for display: data was normalized to [-1, 1], so scale back to [0, 1] original_img = (data[i].cpu().numpy().reshape(28, 28) + 1) / 2 plt.imshow(original_img, cmap='gray') ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) if i == 0: ax.set_title("Original") # Display reconstruction ax = plt.subplot(2, n, i + 1 + n) # Undo normalization for display: decoded_imgs are [-1, 1], scale back to [0, 1] reconstructed_img = (decoded_imgs[i].reshape(28, 28) + 1) / 2 plt.imshow(reconstructed_img, cmap='gray') ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) if i == 0: ax.set_title("Reconstructed") plt.show()You should see that the reconstructed digits, while perhaps a bit blurrier or less sharp than the originals, are generally recognizable. This indicates that our autoencoder has learned a meaningful, compressed representation in its 32-dimensional bottleneck layer and can use this representation to generate a reasonable approximation of the original 784-dimensional input.What We've AccomplishedIn this hands-on session, you've successfully built and trained a basic autoencoder using PyTorch. You've seen how:The encoder maps high-dimensional input to a lower-dimensional bottleneck.The decoder attempts to reconstruct the original input from this compressed representation.The network is trained by minimizing a reconstruction loss (MSE in our case), where the input itself is the target.This simple autoencoder demonstrates the fundamental principles. The latent representation learned by the bottleneck is the foundation for feature extraction, which we will explore in much more detail in the upcoming chapters. For instance, you can get the latent representation by passing input data through autoencoder.encoder(data). We'll soon see how different types of autoencoders and more sophisticated architectures can learn even more powerful and useful features.