All Courses

Handling Datasets: Iterators and Loaders with MLUtils.jl

Effectively managing and feeding data to your neural networks is fundamental for successful model training, especially when working with large datasets. As you begin to construct more complex architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), the mechanisms for iterating, batching, and transforming your data become increasingly important. Julia's MLUtils.jl package provides a suite of tools designed to streamline these data handling tasks, making your deep learning workflows more efficient and manageable.

This section will guide you through using MLUtils.jl to prepare your datasets for training. We'll cover how to iterate over observations, create mini-batches, shuffle data for better generalization, split datasets, and apply on-the-fly transformations. These utilities are designed to integrate smoothly with Flux.jl and your GPU-accelerated training pipelines.

The Role of MLUtils.jl in Your Data Pipeline

Training deep learning models typically involves processing data in chunks, known as mini-batches, rather than one sample at a time or the entire dataset at once. This approach offers a balance between computational efficiency and the stability of the gradient updates during optimization. MLUtils.jl simplifies the creation and management of these mini-batches, along with other common data preparation steps.

Main functionalities provided by MLUtils.jl that are particularly useful for deep learning include:

Observation Iteration: A consistent way to access individual samples from your dataset.
Batching: Grouping observations into mini-batches.
Shuffling: Randomizing the order of data to prevent the model from learning spurious patterns related to data order.
Data Splitting: Dividing your dataset into training, validation, and testing subsets.
Data Transformation: Applying functions to observations, often for data augmentation or preprocessing.

Let's look at how these features are implemented.

Iterating Over Observations with `eachobs`

At the heart of MLUtils.jl is the concept of an "observation." An observation is typically a single data point, often consisting of features and a corresponding label. The eachobs function provides a general way to iterate over observations in your dataset.

If your data is stored as a tuple of arrays (e.g., (features, labels)), eachobs will yield tuples where each element is a slice or view of the original arrays corresponding to a single observation.

using MLUtils

# Sample data: 10 features, 5 samples
X = rand(Float32, 10, 5)
# Corresponding labels for the 5 samples
Y = [1, 0, 1, 1, 0]

# Iterate over each observation
for (x_obs, y_obs) in eachobs((X, Y))
    # x_obs will be a vector of 10 features for one sample
    # y_obs will be a single label
    println("Features: ", size(x_obs), ", Label: ", y_obs)
end

This simple iteration is the foundation upon which more complex operations like batching are built. MLUtils.jl also defines nobs to get the number of observations and getobs to retrieve a specific observation by index.

Batching Data with `DataLoader` and `eachbatch`

For training neural networks, you'll almost always want to process data in mini-batches. MLUtils.jl offers two primary ways to achieve this: the DataLoader type and the eachbatch function.

DataLoader: This constructs an iterator that yields batches of data. It's flexible and allows you to specify batch size, whether to shuffle the data, and how to handle the last batch if it's smaller than the specified size.

using MLUtils

# Sample data: 2 features, 100 samples
features = rand(Float32, 2, 100)
labels = rand(Int, 100) # 100 integer labels
dataset = (features, labels)

# Create a DataLoader
# batchsize: Number of samples per batch
# shuffle: If true, shuffles the data at the beginning of each epoch (iteration over the full dataset)
# partial: If false (default), drops the last batch if it's smaller than batchsize. If true, includes it.
loader = DataLoader(dataset, batchsize=32, shuffle=true, partial=false)

# Iterate through batches
for epoch in 1:3 # Example: 3 epochs
    println("Epoch: ", epoch)
    for (x_batch, y_batch) in loader
        # x_batch will be a 2x32 matrix (features for 32 samples)
        # y_batch will be a vector of 32 labels
        # In this example, with 100 samples and batchsize 32,
        # and partial=false, we'd get 3 batches of 32. 100 = 3*32 + 4. The last 4 are dropped.
        println("Batch sizes: ", size(x_batch), ", ", size(y_batch))
        # Here you would typically perform a training step:
        # 1. Move batch to GPU (if applicable)
        # 2. Forward pass: model(x_batch)
        # 3. Calculate loss: loss_function(predictions, y_batch)
        # 4. Backward pass (calculate gradients)
        # 5. Update model parameters
    end
end

eachbatch: This is a more direct function that often serves the same purpose as DataLoader for simpler use cases. It also takes the data, batchsize, and options like shuffle.

using MLUtils

features = rand(Float32, 2, 100)
labels = rand(Int, 100)
dataset = (features, labels)

# Using eachbatch
for (x_batch, y_batch) in eachbatch(dataset, batchsize=16, shuffle=true)
    # x_batch will be a 2x16 matrix
    # y_batch will be a vector of 16 labels
    # Process the batch...
    # println("Batch sizes using eachbatch: ", size(x_batch), ", ", size(y_batch))
end

Using shuffle=true is generally recommended for training data. It helps prevent the model from learning patterns based on the order of data presentation and can lead to better generalization. For validation or test data, shuffling is usually turned off (shuffle=false) to ensure consistent evaluation.

Splitting Datasets with `splitobs`

A common requirement is to divide your dataset into training, validation, and sometimes test sets. MLUtils.jl provides the splitobs function for this.

using MLUtils

# 1000 samples, 20 features each
X_all = rand(Float32, 20, 1000)
Y_all = rand(Float32, 1, 1000) # Regression task, 1 output
full_dataset = (X_all, Y_all)

# Split into training (80%) and validation (20%)
# shuffle=true is good practice before splitting
train_data, val_data = splitobs(full_dataset, at=0.8, shuffle=true)

# train_data and val_data are themselves tuples: (X_train, Y_train) and (X_val, Y_val)
X_train, Y_train = train_data
X_val, Y_val = val_data

println("Training samples: ", nobs(train_data)) # nobs(X_train) or nobs(Y_train)
println("Validation samples: ", nobs(val_data))

# Now you can create DataLoaders for each set
train_loader = DataLoader(train_data, batchsize=64, shuffle=true)
val_loader = DataLoader(val_data, batchsize=64, shuffle=false) # No shuffle for validation

The at argument specifies the proportion of data for the first part of the split. You can also pass a tuple of proportions to at for multiple splits, for example, at=(0.7, 0.15) to get 70% training, 15% validation, and the remaining 15% for testing.

On-the-Fly Data Transformation with `mapobs`

Often, you'll need to apply transformations to your data before it's batched and fed to the model. This could be normalization, data augmentation (like flipping or rotating images), or converting data types. MLUtils.jl's mapobs function allows you to apply a function to each observation lazily.

using MLUtils

# Imagine raw_images is an array of 100 small grayscale images (e.g., 28x28)
raw_images = [rand(Float32, 28, 28) for _ in 1:100]
labels = rand(0:9, 100) # Labels for digits 0-9

# A simple augmentation: flip an image horizontally
function augment_image(image::Matrix{Float32})
    # Add a channel dimension (Flux.jl CNNs expect WHC or WHCN)
    img_with_channel = reshape(image, size(image)..., 1)
    if rand() > 0.5
        return reverse(img_with_channel, dims=2) # Flip horizontally
    else
        return img_with_channel
    end
end

# Apply the augmentation to each image in raw_images
# mapobs will create a new data source where each observation is the result of augment_image
augmented_dataset = mapobs(augment_image, raw_images)

# We can then create a DataLoader for the (augmented_images, labels)
# Note: For pairing augmented features with labels, you often pass a tuple to mapobs
# or apply mapobs only to the features part of your (features, labels) tuple.

# If labels also need processing or should be passed through:
full_dataset_to_transform = (raw_images, labels) # Assume raw_images is a vector of matrices

# Transformation function for (image, label) pairs
# Here, only image is transformed, label is passed through
function transform_observation(obs_tuple)
    image, label = obs_tuple
    augmented_img = augment_image(image) # Your augmentation from before
    # Flux expects features in WHCN format (Width, Height, Channels, Batch)
    # DataLoader will collate the N (batch) dimension.
    return (augmented_img, label)
end

# Correctly applying mapobs to a dataset tuple
processed_dataset = mapobs(transform_observation, full_dataset_to_transform)

# DataLoader will now get augmented images
loader = DataLoader(processed_dataset, batchsize=10, shuffle=true)

for (img_batch, label_batch) in loader
    # img_batch will be (28, 28, 1, 10)
    # label_batch will be a vector of 10 labels
    # println("Augmented image batch size: ", size(img_batch))
    # println("Label batch size: ", size(label_batch))
end

mapobs is powerful because the transformations are applied as data is requested, which can be memory efficient, especially for complex augmentations.

The following diagram illustrates how these MLUtils.jl components fit into a typical data processing pipeline for deep learning:

Data processing pipeline using MLUtils.jl. Raw data is typically split, then transformations can be applied per observation using mapobs. DataLoader or eachbatch then creates shuffled mini-batches, which are finally prepared (e.g., moved to GPU) and fed to the neural network model. Shuffling can occur at the splitobs stage for the initial split, or within DataLoader for each epoch.

Integrating with GPU Workflows

When training deep learning models, especially larger ones, using GPUs is common for accelerating computations. MLUtils.jl itself does not handle GPU memory transfers. Its role is to provide the data batches. You would then typically move these batches to the GPU within your training loop using functions from Flux.jl (like Flux.gpu or |> gpu) or CUDA.jl (like CUDA.cu).

using Flux # For gpu function and model definition
using CUDA # For cu function if preferred, and to check GPU availability
# Assume:
# model = Chain(...) |> gpu # Your Flux model moved to GPU
# loader = DataLoader(train_data, batchsize=64, shuffle=true)

if CUDA.functional() # Check if a GPU is available and functional
    model = model |> gpu
    println("Training on GPU.")
    for (x_batch, y_batch) in loader
        x_batch_gpu = x_batch |> gpu
        y_batch_gpu = y_batch |> gpu
        
        # Training step with GPU data
        # grads = gradient(params(model)) do
        #     loss(model(x_batch_gpu), y_batch_gpu)
        # end
        # update!(opt, params(model), grads)
    end
else
    println("CUDA GPU not available or not functional. Training on CPU.")
    for (x_batch, y_batch) in loader
        # Training step with CPU data
        # ...
    end
end

This pattern ensures that only the current batch of data resides on the GPU, which is important for managing limited GPU memory.

By providing these versatile tools for data iteration, batching, shuffling, splitting, and transformation, MLUtils.jl significantly simplifies the data preparation aspect of your deep learning projects in Julia. With your data efficiently managed, you can focus more on designing and training the neural network architectures themselves, which we will continue to explore.

Was this section helpful?

Handling Datasets: Iterators and Loaders with MLUtils.jl

The Role of MLUtils.jl in Your Data Pipeline

Iterating Over Observations with eachobs

Batching Data with DataLoader and eachbatch

Splitting Datasets with splitobs

On-the-Fly Data Transformation with mapobs

Integrating with GPU Workflows

Iterating Over Observations with `eachobs`

Batching Data with `DataLoader` and `eachbatch`

Splitting Datasets with `splitobs`

On-the-Fly Data Transformation with `mapobs`