All Courses

Building Multilayer Perceptrons (MLPs)

Multilayer Perceptrons, or MLPs, are a foundational type of neural network. They consist of one or more layers of neurons, where each neuron in a preceding layer is connected to every neuron in the subsequent layer. This dense connectivity is why they are often called "fully connected networks." While simple in concept, MLPs are powerful enough to learn complex, non-linear relationships in data, making them an excellent starting point for understanding neural network architectures. In Chapter 2, you were introduced to Flux.jl's Dense layers and the Chain constructor; these are the primary tools we'll use to build MLPs.

Anatomy of a Multilayer Perceptron

An MLP typically comprises three main types of layers: an input layer, one or more hidden layers, and an output layer.

A typical MLP architecture showing the flow of information from the input layer, through hidden layers, to the output layer. Each connection between layers is "dense," meaning all neurons from the previous layer connect to all neurons in the next.

Let's break down these components:

Input Layer: This layer receives your raw data. The number of neurons in the input layer corresponds directly to the number of features in your dataset. For example, if you're predicting house prices based on 5 features (like area, number of bedrooms, etc.), your input layer would have 5 neurons.
Hidden Layers: These are the intermediate layers between the input and output. They perform most of the computations and are responsible for learning the intricate patterns within the data. The term "deep" in "deep learning" often implies the presence of multiple hidden layers. Each neuron in a hidden layer applies a linear transformation to its inputs (a weighted sum plus a bias) followed by a non-linear activation function.
Output Layer: This layer produces the final prediction of the network. The number of neurons and the activation function used in the output layer depend on the specific task:
- For regression (predicting a continuous value), you typically have one neuron and often no activation function (or an identity activation).
- For binary classification (predicting one of two classes), you'll usually have one neuron with a sigmoid activation function to output a probability.
- For multi-class classification (predicting one of several classes), you'll have N neurons (where N is the number of classes) with a softmax activation function to output a probability distribution across the classes.
Weights and Biases: These are the learnable parameters of the network. During the training process, the network adjusts these weights and biases to minimize the difference between its predictions and the actual target values.
Activation Functions: As discussed in Chapter 2, activation functions introduce non-linearity into the model. Without them, an MLP, no matter how many layers it has, would behave like a simple linear model. Common choices include ReLU (Rectified Linear Unit) for hidden layers, and sigmoid or softmax for output layers depending on the task.

Constructing an MLP with Flux.jl

Building an MLP in Flux.jl is straightforward using Dense layers and combining them into a Chain. A Dense layer, Dense(in::Integer, out::Integer, σ), creates a standard fully connected layer that transforms an input of size in to an output of size out, followed by an activation function σ.

Let's construct a simple MLP. Suppose we have a dataset with 10 input features, and we want to build a network with two hidden layers: the first with 64 neurons and the second with 32 neurons. For a regression task, we'll have a single output neuron.

using Flux

# Define network dimensions
num_features = 10
num_hidden1 = 64
num_hidden2 = 32
num_outputs = 1 # For a regression task

# Construct the MLP
mlp_model = Chain(
    Dense(num_features, num_hidden1, relu),    # Input layer (10 features) to first hidden layer (64 neurons)
    Dense(num_hidden1, num_hidden2, relu),     # First hidden layer (64 neurons) to second hidden layer (32 neurons)
    Dense(num_hidden2, num_outputs)            # Second hidden layer (32 neurons) to output layer (1 neuron)
                                               # No activation specified for the output layer here;
                                               # for regression, this is common (identity activation).
)

# You can print the model to see its structure
println(mlp_model)

In this code:

Chain(...) groups the layers sequentially. The output of one layer becomes the input to the next.
Dense(num_features, num_hidden1, relu) defines our first layer. It takes num_features inputs, produces num_hidden1 outputs, and applies the relu activation function.
The subsequent Dense layers follow the same pattern, connecting the previous layer's output to the next layer's input.
The final Dense(num_hidden2, num_outputs) layer doesn't explicitly specify an activation function. By default, Flux's Dense layer uses an identity activation (x -> x) if none is provided, which is suitable for regression tasks. For classification, you might add sigmoid or softmax here, or more commonly, apply it as part of the loss function calculation or as a final step after the model.

To see how data flows through this model (a "forward pass"), we can create some dummy input data. Input data for Flux models is typically expected to have features as rows and observations (samples) as columns.

# Create a batch of 5 dummy data samples, each with 10 features
batch_size = 5
dummy_data = rand(Float32, num_features, batch_size) # Shape: (10, 5)

# Pass the data through the model
predictions = mlp_model(dummy_data)

println("Input data size: ", size(dummy_data))
println("Output predictions size: ", size(predictions)) # Expected: (1, 5)

The output predictions will be a matrix of size (1, 5), where each column is the regression output for the corresponding input sample.

The Forward Pass: How Data Flows

When you pass data through an MLP like mlp_model(dummy_data), each Dense layer performs two main operations:

Linear Transformation: It computes a weighted sum of the inputs and adds a bias term. Mathematically, for an input vector $X$ , the output $Z$ before activation is $Z = W \cdot X + b$ , where $W$ is the matrix of weights and $b$ is the vector of biases for that layer.
Activation Function: The result $Z$ is then passed through the layer's specified activation function (e.g., relu). So, the final output of the layer is $A = \sigma(Z)$ , where $\sigma$ is the activation function.

The Chain ensures that the output $A$ from one layer becomes the input $X$ for the next layer in the sequence, until the final output layer is reached.

Design Choices for MLPs

When designing an MLP, several choices influence its performance:

Network Depth and Width:
- Depth refers to the number of hidden layers, and width refers to the number of neurons in each hidden layer.
- Deeper and wider networks have more parameters and can therefore model more complex functions. However, they are also more computationally expensive and prone to overfitting (learning the training data too well, including its noise, leading to poor performance on unseen data).
- There's no magic formula for choosing the optimal depth and width. It often involves experimentation, starting with a simpler architecture and gradually increasing complexity if the model underperforms.
Activation Functions:
- Hidden Layers: relu is a very common choice for hidden layers. It's computationally efficient and helps mitigate the "vanishing gradient" problem that can occur in deeper networks with other activations like sigmoid or tanh. Alternatives like leakyrelu or elu can sometimes offer benefits.
- Output Layer: The choice is task-dependent:
  - Regression: Typically no activation (identity function) if the output can be any real number. If the output is constrained (e.g., must be positive), an appropriate activation like softplus might be used.
  - Binary Classification: sigmoid is used to squash the output to a range of $(0, 1)$ , representing a probability.
  - Multi-class Classification: softmax is used to convert the outputs into a probability distribution over the $N$ classes, where each output is in $(0, 1)$ and all outputs sum to 1. Flux provides Flux.sigmoid and Flux.softmax.
Data Scaling:
- MLPs are sensitive to the scale of input features. Features with larger value ranges can dominate the learning process.
- It's standard practice to scale your input data before feeding it to an MLP. Common techniques include:
  - Standardization: Rescaling features to have a mean of 0 and a standard deviation of 1.
  - Normalization: Rescaling features to a specific range, typically $[0, 1]$ or $[-1, 1]$ .
- This chapter's section on "Data Preparation and Preprocessing in Julia" will cover these techniques in more detail.

When to Use (and Not Use) MLPs

Strengths:

MLPs are very effective for structured or tabular data, where features don't have an inherent spatial or sequential order (e.g., customer data, sensor readings for a specific timestamp, financial data).
They are relatively simple to implement and understand, serving as a good baseline model.

Limitations:

Lack of Spatial/Sequential Awareness: Standard MLPs treat all input features independently and don't inherently understand spatial relationships (as in images) or sequential dependencies (as in text or time series). While you can flatten an image into a vector and feed it to an MLP, this discards the 2D structure. For such data, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which we'll cover next, are generally more powerful.
Parameter Inefficiency for High-Dimensional Inputs: For very high-dimensional inputs (e.g., high-resolution images), a fully connected MLP can lead to an enormous number of parameters, making it slow to train and prone to overfitting.

Summary and Next Steps

Multilayer Perceptrons are versatile feedforward neural networks that form a foundation of deep learning. You've now seen how to construct them in Julia using Flux.jl's Dense layers and Chain structure, and understand the design considerations involved.

While MLPs are powerful for certain types of problems, particularly those involving tabular data, this chapter will next introduce Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). These architectures are specifically designed to handle data with spatial and sequential structures, respectively, by incorporating specialized layers that exploit these properties. Understanding MLPs provides a solid foundation for grasping these more advanced architectures.

Was this section helpful?