L1 and L2 regularization can be applied in practice to improve neural network performance. A simple neural network is built and trained on data designed to encourage overfitting. Then, L1 and L2 regularization are applied to observe their impact firsthand. PyTorch is used for implementation, but the concepts apply equally to other frameworks.Setting Up the ScenarioImagine we have a binary classification problem. We'll generate some synthetic data where the decision boundary isn't perfectly linear, making it easy for a flexible model to overfit the training noise.First, let's import necessary libraries and generate some data:import torch import torch.nn as nn import torch.optim as optim from sklearn.datasets import make_moons from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import numpy as np # Generate synthetic data X, y = make_moons(n_samples=500, noise=0.3, random_state=42) # Split data X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) # Scale data scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) # Convert to PyTorch tensors X_train_tensor = torch.tensor(X_train, dtype=torch.float32) y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1) X_val_tensor = torch.tensor(X_val, dtype=torch.float32) y_val_tensor = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1) Defining a Basic Neural NetworkWe'll use a simple feed-forward network with two hidden layers. This architecture is complex enough to potentially overfit our synthetic data.class SimpleNet(nn.Module): def __init__(self): super(SimpleNet, self).__init__() self.layer1 = nn.Linear(2, 128) self.relu1 = nn.ReLU() self.layer2 = nn.Linear(128, 64) self.relu2 = nn.ReLU() self.output_layer = nn.Linear(64, 1) # Output layer for binary classification def forward(self, x): x = self.relu1(self.layer1(x)) x = self.relu2(self.layer2(x)) x = self.output_layer(x) # No sigmoid here, using BCEWithLogitsLoss return xTraining Loop Helper FunctionLet's define a function to handle the training process. This will make it easier to reuse the training logic for different regularization settings.def train_model(model, optimizer, criterion, X_train, y_train, X_val, y_val, epochs=500, l1_lambda=0.0): train_losses = [] val_losses = [] val_accuracies = [] for epoch in range(epochs): model.train() # Set model to training mode # Forward pass outputs = model(X_train) loss = criterion(outputs, y_train) # --- L1 Regularization (if applicable) --- if l1_lambda > 0: l1_penalty = 0 for param in model.parameters(): # Check if the parameter requires gradients (i.e., it's learnable) if param.requires_grad: l1_penalty += torch.norm(param, 1) # Calculate L1 norm loss = loss + l1_lambda * l1_penalty # --- End L1 Regularization --- # Backward and optimize optimizer.zero_grad() loss.backward() optimizer.step() # --- Validation --- model.eval() # Set model to evaluation mode with torch.no_grad(): val_outputs = model(X_val) val_loss = criterion(val_outputs, y_val) # Calculate accuracy predicted = torch.sigmoid(val_outputs) >= 0.5 correct = (predicted == y_val.byte()).sum().item() # Ensure comparison is correct type val_accuracy = correct / y_val.size(0) train_losses.append(loss.item()) val_losses.append(val_loss.item()) val_accuracies.append(val_accuracy) if (epoch + 1) % 100 == 0: print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {loss.item():.4f}, Val Loss: {val_loss.item():.4f}, Val Acc: {val_accuracy:.4f}') return train_losses, val_losses, val_accuraciesNote the section specifically for adding the L1 penalty. We manually iterate through the model's parameters, calculate the L1 norm (torch.norm(param, 1)), sum them up, multiply by the L1 strength (l1_lambda), and add it to the original loss before backpropagation.Baseline Model: No RegularizationLet's train the model without any regularization first to establish a baseline. We'll use the Adam optimizer and binary cross-entropy loss (with logits, as our model doesn't have a final sigmoid).# Instantiate model, criterion, and optimizer (no regularization) model_base = SimpleNet() criterion = nn.BCEWithLogitsLoss() optimizer_base = optim.Adam(model_base.parameters(), lr=0.001) print("Training Baseline Model (No Regularization)...") base_train_loss, base_val_loss, base_val_acc = train_model( model_base, optimizer_base, criterion, X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor, epochs=500 )Typically, you'd observe the training loss decreasing steadily while the validation loss decreases initially but then starts to increase, indicating overfitting. The validation accuracy might plateau or even decrease.Applying L2 Regularization (Weight Decay)Adding L2 regularization is often straightforward as many optimizers include a built-in parameter for it, commonly called weight_decay. This parameter corresponds to the $\lambda$ in the L2 penalty term $\frac{\lambda}{2} ||w||_2^2$.# Instantiate model, criterion, and optimizer with L2 model_l2 = SimpleNet() # Note: Re-instantiate criterion if it has state, although BCEWithLogitsLoss is stateless criterion_l2 = nn.BCEWithLogitsLoss() l2_lambda = 0.01 # Regularization strength optimizer_l2 = optim.Adam(model_l2.parameters(), lr=0.001, weight_decay=l2_lambda) print("\nTraining Model with L2 Regularization...") l2_train_loss, l2_val_loss, l2_val_acc = train_model( model_l2, optimizer_l2, criterion_l2, X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor, epochs=500 )When training with L2 regularization, you should notice that the gap between the training loss and validation loss is smaller compared to the baseline. The validation loss might reach a lower minimum value, and the validation accuracy might improve or be more stable. L2 generally prevents weights from growing too large, leading to a smoother decision boundary.Applying L1 RegularizationAs seen in our train_model function, implementing L1 requires manually adding the penalty to the loss. Let's train a model using this approach.# Instantiate model, criterion, and optimizer for L1 model_l1 = SimpleNet() criterion_l1 = nn.BCEWithLogitsLoss() optimizer_l1 = optim.Adam(model_l1.parameters(), lr=0.001) # No weight_decay here l1_lambda = 0.001 # L1 regularization strength (often smaller than L2) print("\nTraining Model with L1 Regularization...") l1_train_loss, l1_val_loss, l1_val_acc = train_model( model_l1, optimizer_l1, criterion_l1, X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor, epochs=500, l1_lambda=l1_lambda # Pass the L1 lambda to the training function ) With L1 regularization, we also expect to see reduced overfitting, similar to L2. However, L1 has the characteristic effect of potentially driving some weights to exactly zero. This doesn't always manifest dramatically in dense networks but can contribute to simpler models. The optimal l1_lambda might differ significantly from the optimal l2_lambda.Comparing the ResultsVisualizing the validation loss curves for all three models is the best way to see the impact of regularization.{"layout": {"title": "Validation Loss Comparison", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Validation Loss (BCEWithLogitsLoss)", "range": [0.2, 0.8]}, "legend": {"title": "Model"}, "template": "plotly_white"}, "data": [{"x": [10, 50, 100, 200, 300, 400, 500], "y": [0.45, 0.38, 0.35, 0.39, 0.45, 0.52, 0.60], "mode": "lines", "name": "Baseline", "line": {"color": "#495057"}}, {"x": [10, 50, 100, 200, 300, 400, 500], "y": [0.48, 0.42, 0.39, 0.37, 0.36, 0.355, 0.35], "mode": "lines", "name": "L2 (λ=0.01)", "line": {"color": "#228be6"}}, {"x": [10, 50, 100, 200, 300, 400, 500], "y": [0.47, 0.41, 0.38, 0.375, 0.37, 0.365, 0.36], "mode": "lines", "name": "L1 (λ=0.001)", "line": {"color": "#12b886"}}]}Validation loss curves for baseline, L1-regularized, and L2-regularized models over 500 epochs. Note how the baseline loss starts increasing (overfitting), while L1 and L2 maintain lower validation loss. (Illustrative data)You can similarly plot the validation accuracy. The regularized models will likely show higher or more stable validation accuracy compared to the baseline model, which might degrade after initially peaking.{"layout": {"title": "Validation Accuracy Comparison", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Validation Accuracy", "range": [0.75, 0.95]}, "legend": {"title": "Model"}, "template": "plotly_white"}, "data": [{"x": [10, 50, 100, 200, 300, 400, 500], "y": [0.80, 0.85, 0.88, 0.87, 0.85, 0.83, 0.82], "mode": "lines", "name": "Baseline", "line": {"color": "#495057"}}, {"x": [10, 50, 100, 200, 300, 400, 500], "y": [0.78, 0.83, 0.86, 0.88, 0.89, 0.90, 0.905], "mode": "lines", "name": "L2 (λ=0.01)", "line": {"color": "#228be6"}}, {"x": [10, 50, 100, 200, 300, 400, 500], "y": [0.79, 0.84, 0.87, 0.885, 0.88, 0.89, 0.895], "mode": "lines", "name": "L1 (λ=0.001)", "line": {"color": "#12b886"}}]}Validation accuracy curves corresponding to the loss plot above. Regularized models achieve better and more sustained accuracy on unseen data. (Illustrative data)Tuning the Regularization StrengthThe choice of $\lambda$ (the weight_decay or l1_lambda value) is important.Too small $\lambda$: The regularization effect will be minimal, and the model might still overfit.Too large $\lambda$: The penalty on weights will dominate, potentially forcing weights to be too small (or zero for L1) and causing the model to underfit (high bias). It won't be able to learn the underlying patterns effectively.Finding the right $\lambda$ typically involves hyperparameter tuning, often using techniques like grid search or random search on the validation set, which we will discuss later in the course.ConclusionThis practical demonstrated how to implement L1 and L2 weight regularization in a typical PyTorch training workflow. We observed how adding these penalties to the loss function (directly for L1, or via the optimizer's weight_decay for L2) helps combat overfitting, leading to better generalization performance as evidenced by improved validation loss and accuracy. Remember that the effectiveness and the optimal strength ($\lambda$) depend on the specific model, dataset, and task. Experimentation is often necessary to find the best configuration.