Before you can construct sophisticated neural network architectures like Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), or Recurrent Neural Networks (RNNs), you must first ensure your data is in a suitable state. The old adage "garbage in, garbage out" holds particularly true in deep learning. High-quality, well-prepared data is a prerequisite for training effective models. This section covers essential techniques for data preparation and preprocessing using Julia's powerful data manipulation and scientific computing ecosystem. We'll look at cleaning data, transforming features, and structuring datasets so they're ready for Flux.jl.
Data preparation involves taking your raw dataset and transforming it into a clean, structured format that neural networks can digest. This typically includes several steps:
DataFrame
s.Julia, with packages like DataFrames.jl
, CSV.jl
, Statistics.jl
, and MLUtils.jl
, provides an efficient and expressive environment for these tasks. Its performance characteristics are particularly beneficial when dealing with large datasets common in deep learning.
Let's look at a typical workflow for preparing data.
A general pipeline for data preparation in a deep learning project.
Neural networks are sensitive to the way input data is presented. Here are some common preprocessing operations you'll encounter.
Real-world datasets often come with missing values. How you handle them can significantly impact model performance.
Identification: In DataFrames.jl
, missing values are represented by missing
. You can identify them using functions like ismissing()
or get a summary using describe(df, :nmissing)
.
using DataFrames, Statistics
# Sample DataFrame with missing values
data = DataFrame(A = [1, 2, missing, 4, 5], B = [missing, 0.2, 0.3, 0.4, 0.5])
println(describe(data, :nmissing))
Strategies:
dropmissing(df)
). If a column has too many missing values and isn't critical, you might drop the column.# Impute missing values in column A with the mean
mean_A = mean(skipmissing(data.A))
data.A = coalesce.(data.A, mean_A)
# Impute missing values in column B with a specific value (e.g., 0.0)
replace!(data.B, missing => 0.0)
println(data)
The coalesce
function is handy as it returns the first non-missing
argument.
Neural networks, especially those trained with gradient-based optimizers, often perform better and converge faster when input features are on a similar scale. Large differences in feature ranges can lead to an unstable training process.
Min-Max Scaling (Normalization): Rescales features to a fixed range, typically [0, 1] or [-1, 1]. The formula for scaling to [0, 1] is:
Xscaled=Xmax−XminX−XminWhere Xmin and Xmax are the minimum and maximum values of the feature, respectively.
# Sample numeric feature
feature = [10.0, 20.0, 15.0, 30.0, 25.0]
function min_max_scale(X)
X_min = minimum(X)
X_max = maximum(X)
return (X .- X_min) ./ (X_max - X_min)
end
scaled_feature = min_max_scale(feature)
# scaled_feature will be [0.0, 0.5, 0.25, 1.0, 0.75]
println("Original: ", feature)
println("Min-Max Scaled: ", scaled_feature)
Standardization (Z-score Normalization): Rescales features to have a mean (μ) of 0 and a standard deviation (σ) of 1. The formula is:
Xstandardized=σX−μThis method is less affected by outliers than min-max scaling.
using Statistics
# Sample numeric feature
feature = [10.0, 20.0, 15.0, 30.0, 25.0]
function standardize(X)
mu = mean(X)
sigma = std(X)
return (X .- mu) ./ sigma
end
standardized_feature = standardize(feature)
println("Standardized: ", standardized_feature)
# Check mean and std (should be close to 0 and 1)
println("Mean of standardized: ", mean(standardized_feature))
println("Std of standardized: ", std(standardized_feature))
The choice between min-max scaling and standardization depends on the data and the neural network architecture. For images, pixel values are often scaled to [0, 1]. For other types of data, standardization is a common default.
Comparison of a feature's values before and after Min-Max scaling. The scaled values are mapped to the [0,1] range on the secondary y-axis.
Neural networks require numerical inputs. Categorical features (like "color": "red", "blue", "green" or "city": "New York", "London") must be converted into a numerical format.
Integer Encoding (Label Encoding): Assigns a unique integer to each category. For example, "red" -> 0, "blue" -> 1, "green" -> 2. This is simple but can imply an ordinal relationship (e.g., green > blue > red) where none exists. It's suitable for ordinal data (e.g., "low", "medium", "high").
categories = ["cat", "dog", "bird", "cat", "dog"]
unique_cats = unique(categories)
cat_to_int = Dict(c => i for (i, c) in enumerate(unique_cats))
encoded_integers = [cat_to_int[c] for c in categories]
# Example: if unique_cats = ["cat", "dog", "bird"]
# cat_to_int = Dict("cat"=>1, "dog"=>2, "bird"=>3)
# encoded_integers = [1, 2, 3, 1, 2]
println("Integer Encoded: ", encoded_integers)
One-Hot Encoding: Creates a new binary (0 or 1) feature for each unique category. For a given sample, the feature corresponding to its category is 1, and all others are 0.
Example:
"red" -> [1, 0, 0]
"blue" -> [0, 1, 0]
"green" -> [0, 0, 1]
This avoids imposing an artificial order but can lead to high-dimensional feature spaces if there are many unique categories. Flux.jl
provides Flux.onehot
and Flux.onehotbatch
for this.
using Flux
# Using the same categories as above
# Assume unique_cats = ["cat", "dog", "bird"]
# Flux.onehotbatch(data, labels)
# 'data' is the vector of categories, 'labels' is the sorted list of unique categories
one_hot_encoded = Flux.onehotbatch(categories, unique_cats)
# This produces a Flux.OneHotMatrix
# To see it as a regular matrix:
# Matrix(one_hot_encoded)
# Result for categories = ["cat", "dog", "bird", "cat", "dog"]
# and unique_cats = ["bird", "cat", "dog"] (Flux sorts labels internally if not provided sorted)
# It would be:
# 0 1 0 0 1
# 1 0 0 1 0
# 0 0 1 0 0
# (if "bird" is label 1, "cat" label 2, "dog" label 3 after sorting)
# Let's define the labels explicitly for clarity
explicit_labels = ["cat", "dog", "bird"] # Order matters for interpretation
one_hot_encoded_explicit = Flux.onehotbatch(categories, explicit_labels)
println("One-Hot Encoded (Matrix representation):")
display(Matrix(one_hot_encoded_explicit)) # display prints matrices nicely
# Output with explicit_labels = ["cat", "dog", "bird"]:
# 1 0 0 1 0 (cat)
# 0 1 0 0 1 (dog)
# 0 0 1 0 0 (bird)
For features with very high cardinality (many unique values), such as user IDs or words in a vocabulary, neither integer nor one-hot encoding is ideal. In such cases, embedding layers are typically used, which we'll discuss later in this chapter.
Once features are cleaned and transformed, the entire dataset needs to be structured into a format that Flux.jl
models can consume. This usually means converting your data into Julia Array
s of a specific numerical type, often Float32
for efficiency with deep learning libraries and GPUs.
Input Shapes: Different neural network layers expect inputs with specific dimensions:
(width, height, channels, batch_size)
, commonly abbreviated as WHCN.Reshaping Data: You'll often use reshape
to get your data into the correct dimensions. For example, if you have a collection of flattened image vectors, you'd reshape them into the WHCN format for a CNN.
# Suppose you have 100 images, each 28x28 grayscale
# And they are loaded as a vector of 100 matrices (28x28)
# images_vector = [rand(Float32, 28, 28) for _ in 1:100];
# For Flux, it's better to have a single 4D array: (W, H, C, N)
# W=28, H=28, C=1 (grayscale), N=100
# Example:
num_samples = 100
img_width = 28
img_height = 28
channels = 1 # Grayscale
# Flattened data (e.g., from a CSV where each row is an image)
# Each row has 28*28 = 784 pixels. 100 rows.
# This would be a 784x100 matrix if features are rows
# Or 100x784 if samples are rows. Let's assume samples are rows.
flat_data_as_rows = rand(Float32, num_samples, img_width * img_height)
# To use with Flux Dense layer, we want features x samples: (784, 100)
data_for_dense = permutedims(flat_data_as_rows, (2,1)) # Becomes 784x100
# To use with Flux Conv layer, we want WHCN: (28, 28, 1, 100)
# First, ensure data is (features, samples) i.e., 784x100
# Then reshape each column (sample) into WxHxC
data_for_cnn = reshape(data_for_dense, img_width, img_height, channels, num_samples)
println("Shape for Dense layer: ", size(data_for_dense))
println("Shape for CNN layer: ", size(data_for_cnn))
Data Type Conversion: Ensure your data is of type Float32
(or Float64
if precision is critical, but Float32
is standard for DL).
# If data_matrix is Array{Float64, 2}
# data_matrix_f32 = Float32.(data_matrix)
A critical step before training is to split your dataset into at least two, preferably three, subsets:
A common split is 60-80% for training, 10-20% for validation, and 10-20% for testing. MLUtils.jl
provides splitobs
for this.
using MLUtils
# Assuming `features` is your input data (e.g., a matrix)
# and `labels` is your target data (e.g., a vector or matrix)
# features = rand(Float32, 10, 1000) # 10 features, 1000 samples
# labels = rand(Float32, 1, 1000) # 1 output, 1000 samples
# For example:
num_total_samples = 1000
X_data = rand(Float32, 5, num_total_samples) # 5 features
Y_data = Flux.onehotbatch(rand(1:3, num_total_samples), 1:3) # 3 classes
# Split into training (70%) and test (30%)
(X_train, Y_train), (X_test, Y_test) = splitobs((X_data, Y_data), at=0.7, shuffle=true)
# Further split test set into validation and test (e.g., 15% val, 15% test from original)
# Original test set is 30% of total. We want to split it 50/50.
# So, 0.5 of the current X_test goes to validation.
(X_val, Y_val), (X_test_final, Y_test_final) = splitobs((X_test, Y_test), at=0.5, shuffle=false) # No shuffle needed usually here
println("Training samples: ", size(X_train, 2))
println("Validation samples: ", size(X_val, 2))
println("Test samples: ", size(X_test_final, 2))
When splitting, especially for classification tasks with imbalanced classes, it's good practice to use stratified sampling. This ensures that the proportion of each class is roughly the same across the training, validation, and test sets. MLUtils.splitobs
can often handle this if your labels are provided appropriately, or you might need to use more specialized tools from MLJ.jl
or implement custom logic for stratification.
Let's tie some of these concepts together with a small dataset. Imagine we have data about fruits:
Weight (g) | Color | Texture | IsSweet |
---|---|---|---|
150 | Red | Smooth | 1 |
120 | Green | Smooth | 0 |
160 | Red | Bumpy | 1 |
missing | Yellow | Smooth | 1 |
Our goal is to predict IsSweet
(1 for sweet, 0 for not sweet).
Load and Represent Data (using DataFrame for clarity)
using DataFrames, Statistics, Flux
df = DataFrame(
Weight = [150.0, 120.0, 160.0, missing],
Color = ["Red", "Green", "Red", "Yellow"],
Texture = ["Smooth", "Smooth", "Bumpy", "Smooth"],
IsSweet = [1, 0, 1, 1]
)
Handle Missing Values (Impute Weight
)
mean_weight = mean(skipmissing(df.Weight))
df.Weight = coalesce.(df.Weight, mean_weight)
println("DataFrame after handling missing values:")
display(df)
Feature Scaling (Min-Max scale Weight
)
# Using the function defined earlier
function min_max_scale_col(X_col)
X_min = minimum(X_col)
X_max = maximum(X_col)
return (X_col .- X_min) ./ (X_max - X_min)
end
df.Weight_scaled = min_max_scale_col(df.Weight)
Encode Categorical Features (Color
, Texture
) using One-Hot Encoding
unique_colors = unique(df.Color)
color_onehot = Flux.onehotbatch(df.Color, unique_colors)
unique_textures = unique(df.Texture)
texture_onehot = Flux.onehotbatch(df.Texture, unique_textures)
# Convert OneHotArrays to regular matrices for combining
color_matrix = Float32.(Matrix(color_onehot))
texture_matrix = Float32.(Matrix(texture_onehot))
Combine Features into a Single Matrix (Features x Samples)
# Scaled weight (1 feature)
weight_feature_matrix = reshape(Float32.(df.Weight_scaled), 1, nrow(df))
# Combine all feature matrices
# Features are rows, samples are columns
X_matrix = vcat(weight_feature_matrix, color_matrix, texture_matrix)
println("\nFinal feature matrix X (Float32):")
display(X_matrix)
println("Size of X: ", size(X_matrix))
# Prepare labels Y (1 x Samples)
Y_matrix = reshape(Float32.(df.IsSweet), 1, nrow(df))
println("\nLabel matrix Y (Float32):")
display(Y_matrix)
println("Size of Y: ", size(Y_matrix))
This X_matrix
(features) and Y_matrix
(labels) are now in a format suitable for input to a Flux model. They are Float32
arrays, with features arranged row-wise and samples column-wise for X_matrix
.
Thorough data preparation and preprocessing are foundational. These steps ensure your model receives data in the most informative and computationally friendly format. With your data cleaned, transformed, and properly structured, you're now ready to consider how to efficiently load and iterate over it during training, which is the topic of the next section on MLUtils.jl
.
Was this section helpful?
© 2025 ApX Machine Learning