Effectively managing and feeding data to your neural networks is fundamental for successful model training, especially when working with large datasets. As you begin to construct more complex architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), the mechanisms for iterating, batching, and transforming your data become increasingly important. Julia's MLUtils.jl
package provides a suite of tools designed to streamline these data handling tasks, making your deep learning workflows more efficient and manageable.
This section will guide you through using MLUtils.jl
to prepare your datasets for training. We'll cover how to iterate over observations, create mini-batches, shuffle data for better generalization, split datasets, and apply on-the-fly transformations. These utilities are designed to integrate smoothly with Flux.jl and your GPU-accelerated training pipelines.
Training deep learning models typically involves processing data in chunks, known as mini-batches, rather than one sample at a time or the entire dataset at once. This approach offers a balance between computational efficiency and the stability of the gradient updates during optimization. MLUtils.jl
simplifies the creation and management of these mini-batches, along with other common data preparation steps.
Main functionalities provided by MLUtils.jl
that are particularly useful for deep learning include:
Let's look at how these features are implemented.
eachobs
At the heart of MLUtils.jl
is the concept of an "observation." An observation is typically a single data point, often consisting of features and a corresponding label. The eachobs
function provides a general way to iterate over observations in your dataset.
If your data is stored as a tuple of arrays (e.g., (features, labels)
), eachobs
will yield tuples where each element is a slice or view of the original arrays corresponding to a single observation.
using MLUtils
# Sample data: 10 features, 5 samples
X = rand(Float32, 10, 5)
# Corresponding labels for the 5 samples
Y = [1, 0, 1, 1, 0]
# Iterate over each observation
for (x_obs, y_obs) in eachobs((X, Y))
# x_obs will be a vector of 10 features for one sample
# y_obs will be a single label
println("Features: ", size(x_obs), ", Label: ", y_obs)
end
This simple iteration is the foundation upon which more complex operations like batching are built. MLUtils.jl
also defines nobs
to get the number of observations and getobs
to retrieve a specific observation by index.
DataLoader
and eachbatch
For training neural networks, you'll almost always want to process data in mini-batches. MLUtils.jl
offers two primary ways to achieve this: the DataLoader
type and the eachbatch
function.
DataLoader
: This constructs an iterator that yields batches of data. It's flexible and allows you to specify batch size, whether to shuffle the data, and how to handle the last batch if it's smaller than the specified size.
using MLUtils
# Sample data: 2 features, 100 samples
features = rand(Float32, 2, 100)
labels = rand(Int, 100) # 100 integer labels
dataset = (features, labels)
# Create a DataLoader
# batchsize: Number of samples per batch
# shuffle: If true, shuffles the data at the beginning of each epoch (iteration over the full dataset)
# partial: If false (default), drops the last batch if it's smaller than batchsize. If true, includes it.
loader = DataLoader(dataset, batchsize=32, shuffle=true, partial=false)
# Iterate through batches
for epoch in 1:3 # Example: 3 epochs
println("Epoch: ", epoch)
for (x_batch, y_batch) in loader
# x_batch will be a 2x32 matrix (features for 32 samples)
# y_batch will be a vector of 32 labels
# In this example, with 100 samples and batchsize 32,
# and partial=false, we'd get 3 batches of 32. 100 = 3*32 + 4. The last 4 are dropped.
println("Batch sizes: ", size(x_batch), ", ", size(y_batch))
# Here you would typically perform a training step:
# 1. Move batch to GPU (if applicable)
# 2. Forward pass: model(x_batch)
# 3. Calculate loss: loss_function(predictions, y_batch)
# 4. Backward pass (calculate gradients)
# 5. Update model parameters
end
end
eachbatch
: This is a more direct function that often serves the same purpose as DataLoader
for simpler use cases. It also takes the data, batchsize
, and options like shuffle
.
using MLUtils
features = rand(Float32, 2, 100)
labels = rand(Int, 100)
dataset = (features, labels)
# Using eachbatch
for (x_batch, y_batch) in eachbatch(dataset, batchsize=16, shuffle=true)
# x_batch will be a 2x16 matrix
# y_batch will be a vector of 16 labels
# Process the batch...
# println("Batch sizes using eachbatch: ", size(x_batch), ", ", size(y_batch))
end
Using shuffle=true
is generally recommended for training data. It helps prevent the model from learning patterns based on the order of data presentation and can lead to better generalization. For validation or test data, shuffling is usually turned off (shuffle=false
) to ensure consistent evaluation.
splitobs
A common requirement is to divide your dataset into training, validation, and sometimes test sets. MLUtils.jl
provides the splitobs
function for this.
using MLUtils
# 1000 samples, 20 features each
X_all = rand(Float32, 20, 1000)
Y_all = rand(Float32, 1, 1000) # Regression task, 1 output
full_dataset = (X_all, Y_all)
# Split into training (80%) and validation (20%)
# shuffle=true is good practice before splitting
train_data, val_data = splitobs(full_dataset, at=0.8, shuffle=true)
# train_data and val_data are themselves tuples: (X_train, Y_train) and (X_val, Y_val)
X_train, Y_train = train_data
X_val, Y_val = val_data
println("Training samples: ", nobs(train_data)) # nobs(X_train) or nobs(Y_train)
println("Validation samples: ", nobs(val_data))
# Now you can create DataLoaders for each set
train_loader = DataLoader(train_data, batchsize=64, shuffle=true)
val_loader = DataLoader(val_data, batchsize=64, shuffle=false) # No shuffle for validation
The at
argument specifies the proportion of data for the first part of the split. You can also pass a tuple of proportions to at
for multiple splits, for example, at=(0.7, 0.15)
to get 70% training, 15% validation, and the remaining 15% for testing.
mapobs
Often, you'll need to apply transformations to your data before it's batched and fed to the model. This could be normalization, data augmentation (like flipping or rotating images), or converting data types. MLUtils.jl
's mapobs
function allows you to apply a function to each observation lazily.
using MLUtils
# Imagine raw_images is an array of 100 small grayscale images (e.g., 28x28)
raw_images = [rand(Float32, 28, 28) for _ in 1:100]
labels = rand(0:9, 100) # Labels for digits 0-9
# A simple augmentation: flip an image horizontally
function augment_image(image::Matrix{Float32})
# Add a channel dimension (Flux.jl CNNs expect WHC or WHCN)
img_with_channel = reshape(image, size(image)..., 1)
if rand() > 0.5
return reverse(img_with_channel, dims=2) # Flip horizontally
else
return img_with_channel
end
end
# Apply the augmentation to each image in raw_images
# mapobs will create a new data source where each observation is the result of augment_image
augmented_dataset = mapobs(augment_image, raw_images)
# We can then create a DataLoader for the (augmented_images, labels)
# Note: For pairing augmented features with labels, you often pass a tuple to mapobs
# or apply mapobs only to the features part of your (features, labels) tuple.
# If labels also need processing or should be passed through:
full_dataset_to_transform = (raw_images, labels) # Assume raw_images is a vector of matrices
# Transformation function for (image, label) pairs
# Here, only image is transformed, label is passed through
function transform_observation(obs_tuple)
image, label = obs_tuple
augmented_img = augment_image(image) # Your augmentation from before
# Flux expects features in WHCN format (Width, Height, Channels, Batch)
# DataLoader will collate the N (batch) dimension.
return (augmented_img, label)
end
# Correctly applying mapobs to a dataset tuple
processed_dataset = mapobs(transform_observation, full_dataset_to_transform)
# DataLoader will now get augmented images
loader = DataLoader(processed_dataset, batchsize=10, shuffle=true)
for (img_batch, label_batch) in loader
# img_batch will be (28, 28, 1, 10)
# label_batch will be a vector of 10 labels
# println("Augmented image batch size: ", size(img_batch))
# println("Label batch size: ", size(label_batch))
end
mapobs
is powerful because the transformations are applied as data is requested, which can be memory efficient, especially for complex augmentations.
The following diagram illustrates how these MLUtils.jl
components fit into a typical data processing pipeline for deep learning:
Data processing pipeline using
MLUtils.jl
. Raw data is typically split, then transformations can be applied per observation usingmapobs
.DataLoader
oreachbatch
then creates shuffled mini-batches, which are finally prepared (e.g., moved to GPU) and fed to the neural network model. Shuffling can occur at thesplitobs
stage for the initial split, or withinDataLoader
for each epoch.
When training deep learning models, especially larger ones, using GPUs is common for accelerating computations. MLUtils.jl
itself does not handle GPU memory transfers. Its role is to provide the data batches. You would then typically move these batches to the GPU within your training loop using functions from Flux.jl
(like Flux.gpu
or |> gpu
) or CUDA.jl
(like CUDA.cu
).
using Flux # For gpu function and model definition
using CUDA # For cu function if preferred, and to check GPU availability
# Assume:
# model = Chain(...) |> gpu # Your Flux model moved to GPU
# loader = DataLoader(train_data, batchsize=64, shuffle=true)
if CUDA.functional() # Check if a GPU is available and functional
model = model |> gpu
println("Training on GPU.")
for (x_batch, y_batch) in loader
x_batch_gpu = x_batch |> gpu
y_batch_gpu = y_batch |> gpu
# Training step with GPU data
# grads = gradient(params(model)) do
# loss(model(x_batch_gpu), y_batch_gpu)
# end
# update!(opt, params(model), grads)
end
else
println("CUDA GPU not available or not functional. Training on CPU.")
for (x_batch, y_batch) in loader
# Training step with CPU data
# ...
end
end
This pattern ensures that only the current batch of data resides on the GPU, which is important for managing limited GPU memory.
By providing these versatile tools for data iteration, batching, shuffling, splitting, and transformation, MLUtils.jl
significantly simplifies the data preparation aspect of your deep learning projects in Julia. With your data efficiently managed, you can focus more on designing and training the neural network architectures themselves, which we will continue to explore.
Was this section helpful?
© 2025 ApX Machine Learning