All Courses

Hands-on Practical: Training and Fine-tuning a Model

Now that you have a solid grasp of how to construct training loops, evaluate models, and apply techniques like regularization, it's time to put all these pieces together. This hands-on practical session will guide you through training a neural network, evaluating its performance, and then fine-tuning it to achieve better results. We'll simulate a common workflow, emphasizing the iterative nature of model development.

For this exercise, we'll tackle a binary classification problem using a synthetic dataset. This allows us to focus on the training and tuning mechanics without getting bogged down in complex data loading or preprocessing.

Setting the Stage: Data and Initial Model

First, let's set up our environment and generate some data. We'll need Flux, MLUtils for data handling, Plots (or your preferred plotting library) for visualization, and Random for reproducibility.

using Flux
using MLUtils: DataLoader, unsqueeze
using Random
using Printf
using Statistics: mean

# For reproducibility
Random.seed!(123)

# Generate a synthetic dataset for binary classification
function generate_data(n_samples=200)
    # Class 0: centered around (-1, -1)
    X1 = randn(Float32, 2, n_samples ÷ 2) .- 1.0f0
    Y1 = zeros(Int, n_samples ÷ 2)

    # Class 1: centered around (1, 1)
    X2 = randn(Float32, 2, n_samples ÷ 2) .+ 1.0f0
    Y2 = ones(Int, n_samples ÷ 2)

    X = hcat(X1, X2)
    Y = vcat(Y1, Y2)

    # Shuffle the data
    indices = shuffle(1:n_samples)
    X = X[:, indices]
    Y = Y[indices]
    
    # Reshape Y for Flux's binary cross-entropy (expects 1xN)
    return X, unsqueeze(Float32.(Y), 1)
end

X_train, Y_train = generate_data(400)
X_test, Y_test = generate_data(100)

# Create DataLoaders
batch_size = 32
train_loader = DataLoader((X_train, Y_train), batchsize=batch_size, shuffle=true)
test_loader = DataLoader((X_test, Y_test), batchsize=batch_size)

Now, let's define a simple Multilayer Perceptron (MLP) for our classification task.

input_dim = 2
hidden_dim = 10
output_dim = 1 # Single output for binary classification with sigmoid

model_v1 = Chain(
    Dense(input_dim, hidden_dim, relu),
    Dense(hidden_dim, output_dim) # Sigmoid will be applied via logitbinarycrossentropy
)

# Loss function and optimizer
loss_fn(x, y) = Flux.logitbinarycrossentropy(model_v1(x), y)
optimizer = Adam(0.01) # Initial learning rate

# Parameters to train
params = Flux.params(model_v1)

Here, logitbinarycrossentropy is suitable as it applies a sigmoid internally and is numerically more stable than binarycrossentropy with a separate sigmoid layer.

Initial Training Run

Let's implement a basic training loop and train our model_v1.

function accuracy(model, data_loader)
    correct = 0
    total = 0
    for (x, y) in data_loader
        # Apply sigmoid to model output to get probabilities
        y_hat_prob = sigmoid.(model(x)) 
        # Convert probabilities to binary predictions (0 or 1)
        y_hat = ifelse.(y_hat_prob .> 0.5f0, 1.0f0, 0.0f0)
        correct += sum(y_hat .== y)
        total += length(y)
    end
    return correct / total
end

epochs = 20
history_v1 = Dict("loss" => Float64[], "accuracy" => Float64[])

println("Training model_v1...")
for epoch in 1:epochs
    epoch_loss = 0.0
    for (x_batch, y_batch) in train_loader
        # Calculate loss and gradients
        batch_loss, grads = Flux.withgradient(params) do
            loss_fn(x_batch, y_batch)
        end
        # Update parameters
        Flux.update!(optimizer, params, grads)
        epoch_loss += batch_loss * size(x_batch, 2) # Weighted by batch size
    end
    avg_epoch_loss = epoch_loss / size(X_train, 2)
    
    # Calculate accuracy on the training set for monitoring
    train_acc = accuracy(model_v1, train_loader)
    
    push!(history_v1["loss"], avg_epoch_loss)
    push!(history_v1["accuracy"], train_acc)
    
    @printf "Epoch %2d: Loss = %.4f, Train Accuracy = %.2f%%\n" epoch avg_epoch_loss (train_acc * 100)
end

# Evaluate on test set
test_acc_v1 = accuracy(model_v1, test_loader)
println("Final Test Accuracy (model_v1): $(test_acc_v1 * 100)%")

After running this, you'll likely see a decent accuracy, but perhaps there's room for improvement. Let's assume our model_v1 achieved around 90-95% test accuracy. The training loss should decrease, and accuracy should increase over epochs.

Adding Regularization: Dropout

One common issue is overfitting, where the model performs well on training data but poorly on unseen test data. Regularization techniques help combat this. Let's add Dropout to our model.

model_v2 = Chain(
    Dense(input_dim, hidden_dim, relu),
    Dropout(0.3), # Add dropout after the first hidden layer
    Dense(hidden_dim, output_dim)
)

loss_fn_v2(x, y) = Flux.logitbinarycrossentropy(model_v2(x), y)
optimizer_v2 = Adam(0.01) # Reset optimizer or use a new one
params_v2 = Flux.params(model_v2)

history_v2 = Dict("loss" => Float64[], "accuracy" => Float64[])

println("\nTraining model_v2 with Dropout...")
for epoch in 1:epochs # Same number of epochs for comparison
    epoch_loss = 0.0
    # Important: Set model to training mode for Dropout
    Flux.trainmode!(model_v2) 
    for (x_batch, y_batch) in train_loader
        batch_loss, grads = Flux.withgradient(params_v2) do
            loss_fn_v2(x_batch, y_batch)
        end
        Flux.update!(optimizer_v2, params_v2, grads)
        epoch_loss += batch_loss * size(x_batch, 2)
    end
    avg_epoch_loss = epoch_loss / size(X_train, 2)
    
    # Important: Set model to test mode for evaluation
    Flux.testmode!(model_v2) 
    train_acc = accuracy(model_v2, train_loader)
    
    push!(history_v2["loss"], avg_epoch_loss)
    push!(history_v2["accuracy"], train_acc)
    
    @printf "Epoch %2d: Loss = %.4f, Train Accuracy = %.2f%%\n" epoch avg_epoch_loss (train_acc * 100)
end

Flux.testmode!(model_v2) # Ensure test mode for final evaluation
test_acc_v2 = accuracy(model_v2, test_loader)
println("Final Test Accuracy (model_v2 with Dropout): $(test_acc_v2 * 100)%")

When using layers like Dropout or BatchNorm, it's important to switch the model between training mode (Flux.trainmode!) and test mode (Flux.testmode!). Dropout is only active during training. For our simple dataset, dropout might not show a dramatic improvement or could even slightly degrade performance if the model wasn't overfitting much to begin with. However, on more complex datasets, it's a valuable tool.

Hyperparameter Tuning: Adjusting Learning Rate

The learning rate is one of the most significant hyperparameters. A rate that's too high can cause the optimizer to overshoot the minimum, while one that's too low can lead to very slow convergence or getting stuck in suboptimal local minima.

Let's try a different learning rate with our model_v1 (the non-dropout version for a clearer comparison of just the learning rate effect).

# Re-initialize model_v1 or create a new instance if you want to keep the old one
model_v3 = Chain(
    Dense(input_dim, hidden_dim, relu),
    Dense(hidden_dim, output_dim)
)

loss_fn_v3(x, y) = Flux.logitbinarycrossentropy(model_v3(x), y)
# Try a smaller learning rate
optimizer_v3 = Adam(0.001) 
params_v3 = Flux.params(model_v3)

history_v3 = Dict("loss" => Float64[], "accuracy" => Float64[])

println("\nTraining model_v3 with learning rate 0.001...")
for epoch in 1:epochs # Use the same number of epochs
    epoch_loss = 0.0
    for (x_batch, y_batch) in train_loader
        batch_loss, grads = Flux.withgradient(params_v3) do
            loss_fn_v3(x_batch, y_batch)
        end
        Flux.update!(optimizer_v3, params_v3, grads)
        epoch_loss += batch_loss * size(x_batch, 2)
    end
    avg_epoch_loss = epoch_loss / size(X_train, 2)
    
    train_acc = accuracy(model_v3, train_loader)
    
    push!(history_v3["loss"], avg_epoch_loss)
    push!(history_v3["accuracy"], train_acc)
    
    @printf "Epoch %2d: Loss = %.4f, Train Accuracy = %.2f%%\n" epoch avg_epoch_loss (train_acc * 100)
end

test_acc_v3 = accuracy(model_v3, test_loader)
println("Final Test Accuracy (model_v3 with LR 0.001): $(test_acc_v3 * 100)%")

Compare test_acc_v3 with test_acc_v1. Did the smaller learning rate help, hinder, or make little difference? Sometimes a smaller learning rate requires more epochs to converge. This process of trying different values is the essence of hyperparameter tuning. More systematic approaches include grid search, random search, or Bayesian optimization, which are beyond this initial hands-on but build upon this trial-and-error foundation.

Using Callbacks for Training Oversight

Callbacks can simplify your training loop and add powerful functionality, like logging metrics, saving models, or implementing early stopping. Flux doesn't have a direct built-in callback system as extensive as some Python frameworks, but you can easily implement similar logic. For instance, Flux.Optimise.run! accepts a callback.

Let's demonstrate a simple custom logging callback within our manual loop. For more complex scenarios, you might use Flux.Optimise.run! or libraries that extend Flux with callback functionalities.

Here's how you might integrate a simple logging action:

# ... (model, loss, optimizer, params defined as before) ...
# Example: Training model_v1 again, but with an explicit callback-like action

println("\nTraining model_v1 with a simple callback-like action...")
for epoch in 1:epochs
    # Callback action at the start of an epoch
    # println("Starting epoch $epoch...") 
    
    Flux.train!(loss_fn, params, train_loader, optimizer) # Using Flux.train! for brevity

    # Callback action at the end of an epoch
    current_loss = mean(loss_fn(x, y) for (x,y) in train_loader) # Approximate
    current_acc = accuracy(model_v1, train_loader)
    @printf "Epoch %2d: Loss = %.4f, Train Accuracy = %.2f%%\n" epoch current_loss (current_acc * 100)
    
    # Example: Early stopping (very basic)
    # if current_loss < 0.05 break end 
end

Flux.train! simplifies the batch iteration part of the training loop. More sophisticated callbacks, like those for saving the best model or adjusting the learning rate dynamically (learning rate scheduling), can be integrated into this loop structure.

Visualizing Training Progress

Visualizing metrics like loss and accuracy over epochs is invaluable for understanding model behavior. Let's plot the training loss for one of our models using a placeholder for a plotting library. If you have Plots.jl and a backend (like GR or PlotlyJS) installed, you can adapt this.

Training loss curve for a run of model_v1. A decreasing trend indicates learning.

To generate this plot with your actual history_v1["loss"] data using Plots.jl:

# Assuming Plots.jl is installed and you have a backend
# using Plots
# plotly() # Or your preferred backend

# plot(1:epochs, history_v1["loss"], label="Model v1 Training Loss",
#      xlabel="Epoch", ylabel="Loss", title="Training Loss Over Epochs",
#      linewidth=2, marker=:circle)
# You could similarly plot accuracy:
# plot!(1:epochs, history_v1["accuracy"], label="Model v1 Training Accuracy", seriestype=:line)

Observing these plots helps diagnose issues. A flat loss curve might indicate a learning rate that's too small or a problem with gradients. A loss that increases could mean the learning rate is too high. If training accuracy is high but test accuracy is low, it signals overfitting.

Fine-Tuning and Iteration

"Fine-tuning" broadly refers to the process of making adjustments to a model or training process to improve performance. This can range from:

Adjusting hyperparameters: As we did with the learning rate. Other candidates include batch size, number of layers, number of neurons per layer, or optimizer type (e.g., trying RMSProp instead of Adam).
Modifying model architecture: Adding or removing layers, changing activation functions.
Changing regularization: Adjusting dropout rates, adding or tuning weight decay.
Data augmentation: If applicable, to artificially increase the diversity of your training data.
Transfer learning: Using a pre-trained model and adapting it to your specific task (covered in more detail in a later chapter).

The steps we took by adding dropout and changing the learning rate are forms of fine-tuning. Each change should ideally be evaluated systematically. Typically, you'd change one thing at a time to understand its impact.

Summary of this Practical Exercise

In this session, you've:

Set up a synthetic dataset and a basic Flux model.
Implemented a training loop and trained an initial version of the model.
Evaluated the model's accuracy.
Introduced Dropout as a regularization technique and observed its effect, remembering trainmode! and testmode!.
Experimented with a different learning rate, a common hyperparameter tuning step.
Briefly looked at how callback-like actions can be integrated for logging.
Considered how to visualize training metrics like loss.

This iterative cycle of training, evaluating, and refining is central to applied deep learning. Each dataset and problem will present unique challenges, but the foundational techniques for training and fine-tuning explored here provide a strong starting point for building effective models with Julia and Flux.jl. Remember that patience and systematic experimentation are your best allies in this process.

Was this section helpful?