All Courses

Data Preparation and Preprocessing in Julia

Before you can construct sophisticated neural network architectures like Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), or Recurrent Neural Networks (RNNs), you must first ensure your data is in a suitable state. The old adage "garbage in, garbage out" holds particularly true in deep learning. High-quality, well-prepared data is a prerequisite for training effective models. This section covers essential techniques for data preparation and preprocessing using Julia's powerful data manipulation and scientific computing ecosystem. We'll look at cleaning data, transforming features, and structuring datasets so they're ready for Flux.jl.

The Indispensable First Step: Preparing Your Data

Data preparation involves taking your raw dataset and transforming it into a clean, structured format that neural networks can digest. This typically includes several steps:

Loading Data: Reading data from files (like CSVs) or databases into Julia structures, often DataFrames.
Cleaning Data: Handling missing values, correcting errors, and removing inconsistencies.
Transforming Data: Scaling numerical features, encoding categorical features, and potentially creating new features.
Structuring Data: Reshaping data into the specific array formats required by neural network layers and splitting it into training, validation, and test sets.

Julia, with packages like DataFrames.jl, CSV.jl, Statistics.jl, and MLUtils.jl, provides an efficient and expressive environment for these tasks. Its performance characteristics are particularly beneficial when dealing with large datasets common in deep learning.

Let's look at a typical workflow for preparing data.

A general pipeline for data preparation in a deep learning project.

Common Preprocessing Operations

Neural networks are sensitive to the way input data is presented. Here are some common preprocessing operations you'll encounter.

Tackling Missing Values

Real-world datasets often come with missing values. How you handle them can significantly impact model performance.

Identification: In DataFrames.jl, missing values are represented by missing. You can identify them using functions like ismissing() or get a summary using describe(df, :nmissing).

using DataFrames, Statistics

# Sample DataFrame with missing values
data = DataFrame(A = [1, 2, missing, 4, 5], B = [missing, 0.2, 0.3, 0.4, 0.5])
println(describe(data, :nmissing))

Strategies:
- Removal: If only a small number of rows have missing values and your dataset is large, you might remove these rows (dropmissing(df)). If a column has too many missing values and isn't critical, you might drop the column.
- Imputation: Filling missing values. Common strategies include:
  - Mean/Median Imputation: Replace missing numerical values with the mean or median of the column. Median is often preferred if the data has outliers.
  - Mode Imputation: Replace missing categorical values with the mode (most frequent value) of the column.
  - Model-based Imputation: Use other features to predict and fill missing values (more advanced).
```
# Impute missing values in column A with the mean
mean_A = mean(skipmissing(data.A))
data.A = coalesce.(data.A, mean_A)

# Impute missing values in column B with a specific value (e.g., 0.0)
replace!(data.B, missing => 0.0)
println(data)
```
The coalesce function is handy as it returns the first non-missing argument.

Feature Scaling: Bringing Data into Range

Neural networks, especially those trained with gradient-based optimizers, often perform better and converge faster when input features are on a similar scale. Large differences in feature ranges can lead to an unstable training process.

Min-Max Scaling (Normalization): Rescales features to a fixed range, typically [0, 1] or [-1, 1]. The formula for scaling to [0, 1] is:
$X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$
Where $X_{min}$ and $X_{max}$ are the minimum and maximum values of the feature, respectively.
```
# Sample numeric feature
feature = [10.0, 20.0, 15.0, 30.0, 25.0]

function min_max_scale(X)
    X_min = minimum(X)
    X_max = maximum(X)
    return (X .- X_min) ./ (X_max - X_min)
end

scaled_feature = min_max_scale(feature)
# scaled_feature will be [0.0, 0.5, 0.25, 1.0, 0.75]
println("Original: ", feature)
println("Min-Max Scaled: ", scaled_feature)
```

Standardization (Z-score Normalization): Rescales features to have a mean ( $\mu$ ) of 0 and a standard deviation ( $\sigma$ ) of 1. The formula is:

X_{standardized} = \frac{X - \mu}{\sigma}

This method is less affected by outliers than min-max scaling.

using Statistics

# Sample numeric feature
feature = [10.0, 20.0, 15.0, 30.0, 25.0]

function standardize(X)
    mu = mean(X)
    sigma = std(X)
    return (X .- mu) ./ sigma
end

standardized_feature = standardize(feature)
println("Standardized: ", standardized_feature)
# Check mean and std (should be close to 0 and 1)
println("Mean of standardized: ", mean(standardized_feature))
println("Std of standardized: ", std(standardized_feature))

The choice between min-max scaling and standardization depends on the data and the neural network architecture. For images, pixel values are often scaled to [0, 1]. For other types of data, standardization is a common default.

Comparison of a feature's values before and after Min-Max scaling. The scaled values are mapped to the [0,1] range on the secondary y-axis.

Encoding Categorical Features

Neural networks require numerical inputs. Categorical features (like "color": "red", "blue", "green" or "city": "New York", "London") must be converted into a numerical format.

Integer Encoding (Label Encoding): Assigns a unique integer to each category. For example, "red" -> 0, "blue" -> 1, "green" -> 2. This is simple but can imply an ordinal relationship (e.g., green > blue > red) where none exists. It's suitable for ordinal data (e.g., "low", "medium", "high").

categories = ["cat", "dog", "bird", "cat", "dog"]
unique_cats = unique(categories)
cat_to_int = Dict(c => i for (i, c) in enumerate(unique_cats))
encoded_integers = [cat_to_int[c] for c in categories]
# Example: if unique_cats = ["cat", "dog", "bird"]
# cat_to_int = Dict("cat"=>1, "dog"=>2, "bird"=>3)
# encoded_integers = [1, 2, 3, 1, 2]
println("Integer Encoded: ", encoded_integers)

One-Hot Encoding: Creates a new binary (0 or 1) feature for each unique category. For a given sample, the feature corresponding to its category is 1, and all others are 0. Example: "red" -> [1, 0, 0] "blue" -> [0, 1, 0] "green" -> [0, 0, 1] This avoids imposing an artificial order but can lead to high-dimensional feature spaces if there are many unique categories. Flux.jl provides Flux.onehot and Flux.onehotbatch for this.

using Flux

# Using the same categories as above
# Assume unique_cats = ["cat", "dog", "bird"]
# Flux.onehotbatch(data, labels)
# 'data' is the vector of categories, 'labels' is the sorted list of unique categories
one_hot_encoded = Flux.onehotbatch(categories, unique_cats)
# This produces a Flux.OneHotMatrix
# To see it as a regular matrix:
# Matrix(one_hot_encoded)
# Result for categories = ["cat", "dog", "bird", "cat", "dog"]
# and unique_cats = ["bird", "cat", "dog"] (Flux sorts labels internally if not provided sorted)
# It would be:
# 0  1  0  0  1
# 1  0  0  1  0
# 0  0  1  0  0
# (if "bird" is label 1, "cat" label 2, "dog" label 3 after sorting)
# Let's define the labels explicitly for clarity
explicit_labels = ["cat", "dog", "bird"] # Order matters for interpretation
one_hot_encoded_explicit = Flux.onehotbatch(categories, explicit_labels)
println("One-Hot Encoded (Matrix representation):")
display(Matrix(one_hot_encoded_explicit)) # display prints matrices nicely
# Output with explicit_labels = ["cat", "dog", "bird"]:
# 1  0  0  1  0  (cat)
# 0  1  0  0  1  (dog)
# 0  0  1  0  0  (bird)

For features with very high cardinality (many unique values), such as user IDs or words in a vocabulary, neither integer nor one-hot encoding is ideal. In such cases, embedding layers are typically used, which we'll discuss later in this chapter.

Structuring Data for Neural Network Input

Once features are cleaned and transformed, the entire dataset needs to be structured into a format that Flux.jl models can consume. This usually means converting your data into Julia Arrays of a specific numerical type, often Float32 for efficiency with deep learning libraries and GPUs.

Input Shapes: Different neural network layers expect inputs with specific dimensions:
- Dense (Fully Connected) Layers: Typically expect a matrix where each column is a sample and rows are features. If you have $N$ samples and $F$ features, the input matrix would be $F \times N$ .
- Convolutional Layers (for images): Often expect a 4D array: (width, height, channels, batch_size), commonly abbreviated as WHCN.
- Recurrent Layers (for sequences): Expect sequences, often as a matrix where columns are time steps and rows are features per time step, or a vector of sequences.

Reshaping Data: You'll often use reshape to get your data into the correct dimensions. For example, if you have a collection of flattened image vectors, you'd reshape them into the WHCN format for a CNN.

# Suppose you have 100 images, each 28x28 grayscale
# And they are loaded as a vector of 100 matrices (28x28)
# images_vector = [rand(Float32, 28, 28) for _ in 1:100];

# For Flux, it's better to have a single 4D array: (W, H, C, N)
# W=28, H=28, C=1 (grayscale), N=100
# Example:
num_samples = 100
img_width = 28
img_height = 28
channels = 1 # Grayscale

# Flattened data (e.g., from a CSV where each row is an image)
# Each row has 28*28 = 784 pixels. 100 rows.
# This would be a 784x100 matrix if features are rows
# Or 100x784 if samples are rows. Let's assume samples are rows.
flat_data_as_rows = rand(Float32, num_samples, img_width * img_height)

# To use with Flux Dense layer, we want features x samples: (784, 100)
data_for_dense = permutedims(flat_data_as_rows, (2,1)) # Becomes 784x100

# To use with Flux Conv layer, we want WHCN: (28, 28, 1, 100)
# First, ensure data is (features, samples) i.e., 784x100
# Then reshape each column (sample) into WxHxC
data_for_cnn = reshape(data_for_dense, img_width, img_height, channels, num_samples)

println("Shape for Dense layer: ", size(data_for_dense))
println("Shape for CNN layer: ", size(data_for_cnn))

Data Type Conversion: Ensure your data is of type Float32 (or Float64 if precision is critical, but Float32 is standard for DL).
```
# If data_matrix is Array{Float64, 2}
# data_matrix_f32 = Float32.(data_matrix)
```

Splitting Your Dataset

A critical step before training is to split your dataset into at least two, preferably three, subsets:

Training Set: Used to train the model. The model learns by adjusting its parameters based on this data.
Validation Set: Used to tune hyperparameters (like learning rate, number of layers) and for early stopping. It provides an unbiased evaluation of the model's fit on the training dataset while tuning.
Test Set: Used for a final, unbiased evaluation of the trained model's performance on unseen data. This set should only be used once the model is fully trained and tuned.

A common split is 60-80% for training, 10-20% for validation, and 10-20% for testing. MLUtils.jl provides splitobs for this.

using MLUtils

# Assuming `features` is your input data (e.g., a matrix)
# and `labels` is your target data (e.g., a vector or matrix)
# features = rand(Float32, 10, 1000) # 10 features, 1000 samples
# labels = rand(Float32, 1, 1000)   # 1 output, 1000 samples

# For example:
num_total_samples = 1000
X_data = rand(Float32, 5, num_total_samples) # 5 features
Y_data = Flux.onehotbatch(rand(1:3, num_total_samples), 1:3) # 3 classes

# Split into training (70%) and test (30%)
(X_train, Y_train), (X_test, Y_test) = splitobs((X_data, Y_data), at=0.7, shuffle=true)

# Further split test set into validation and test (e.g., 15% val, 15% test from original)
# Original test set is 30% of total. We want to split it 50/50.
# So, 0.5 of the current X_test goes to validation.
(X_val, Y_val), (X_test_final, Y_test_final) = splitobs((X_test, Y_test), at=0.5, shuffle=false) # No shuffle needed usually here

println("Training samples: ", size(X_train, 2))
println("Validation samples: ", size(X_val, 2))
println("Test samples: ", size(X_test_final, 2))

When splitting, especially for classification tasks with imbalanced classes, it's good practice to use stratified sampling. This ensures that the proportion of each class is roughly the same across the training, validation, and test sets. MLUtils.splitobs can often handle this if your labels are provided appropriately, or you might need to use more specialized tools from MLJ.jl or implement custom logic for stratification.

A Brief Workflow Example

Let's tie some of these concepts together with a small dataset. Imagine we have data about fruits:

Weight (g)	Color	Texture	IsSweet
150	Red	Smooth	1
120	Green	Smooth	0
160	Red	Bumpy	1
missing	Yellow	Smooth	1

Our goal is to predict IsSweet (1 for sweet, 0 for not sweet).

Load and Represent Data (using DataFrame for clarity)

using DataFrames, Statistics, Flux

df = DataFrame(
    Weight = [150.0, 120.0, 160.0, missing],
    Color = ["Red", "Green", "Red", "Yellow"],
    Texture = ["Smooth", "Smooth", "Bumpy", "Smooth"],
    IsSweet = [1, 0, 1, 1]
)

Handle Missing Values (Impute Weight)

mean_weight = mean(skipmissing(df.Weight))
df.Weight = coalesce.(df.Weight, mean_weight)
println("DataFrame after handling missing values:")
display(df)

Feature Scaling (Min-Max scale Weight)

# Using the function defined earlier
function min_max_scale_col(X_col)
    X_min = minimum(X_col)
    X_max = maximum(X_col)
    return (X_col .- X_min) ./ (X_max - X_min)
end
df.Weight_scaled = min_max_scale_col(df.Weight)

Encode Categorical Features (Color, Texture) using One-Hot Encoding

unique_colors = unique(df.Color)
color_onehot = Flux.onehotbatch(df.Color, unique_colors)

unique_textures = unique(df.Texture)
texture_onehot = Flux.onehotbatch(df.Texture, unique_textures)

# Convert OneHotArrays to regular matrices for combining
color_matrix = Float32.(Matrix(color_onehot))
texture_matrix = Float32.(Matrix(texture_onehot))

Combine Features into a Single Matrix (Features x Samples)

# Scaled weight (1 feature)
weight_feature_matrix = reshape(Float32.(df.Weight_scaled), 1, nrow(df))

# Combine all feature matrices
# Features are rows, samples are columns
X_matrix = vcat(weight_feature_matrix, color_matrix, texture_matrix)
println("\nFinal feature matrix X (Float32):")
display(X_matrix)
println("Size of X: ", size(X_matrix))

# Prepare labels Y (1 x Samples)
Y_matrix = reshape(Float32.(df.IsSweet), 1, nrow(df))
println("\nLabel matrix Y (Float32):")
display(Y_matrix)
println("Size of Y: ", size(Y_matrix))

This X_matrix (features) and Y_matrix (labels) are now in a format suitable for input to a Flux model. They are Float32 arrays, with features arranged row-wise and samples column-wise for X_matrix.

Thorough data preparation and preprocessing are foundational. These steps ensure your model receives data in the most informative and computationally friendly format. With your data cleaned, transformed, and properly structured, you're now ready to consider how to efficiently load and iterate over it during training, which is the topic of the next section on MLUtils.jl.

Was this section helpful?