Neural networks derive much of their expressive capability from non-linear activation functions. Without them, a multi-layered network would, in essence, collapse into a single linear transformation, severely limiting its ability to model complex relationships in data. Activation functions introduce these necessary non-linearities, applied typically after the linear transformation (weights and biases) within a neuron or layer. As we build models with Flux.jl, understanding and correctly applying these functions is a significant step towards effective network design.
Flux.jl provides a suite of common activation functions that are readily available and optimized for performance. These functions are standard Julia functions and can be applied element-wise to arrays or incorporated directly into layer definitions. This flexibility allows for both standard network constructions and more experimental architectures.
Let's examine some of the most frequently used activation functions, their properties, and how to use them in Flux.
Each activation function has distinct characteristics that influence a neural network's learning dynamics and performance.
The sigmoid function, also known as the logistic function, maps any real-valued number into a range between 0 and 1. Its mathematical form is: σ(x)=1+e−x1 Historically, sigmoid was a popular choice for hidden layers, but it has largely been superseded by functions like ReLU due to certain drawbacks.
Properties:
In Flux.jl, you can use sigmoid
:
using Flux
input_data = randn(Float32, 5) # Example input
output_data = sigmoid.(input_data) # Apply sigmoid element-wise
println(output_data)
# Within a layer
layer = Dense(10, 5, sigmoid) # A dense layer with 5 output neurons using sigmoid activation
The sigmoid activation function, mapping inputs to the (0, 1) range.
The hyperbolic tangent function, or tanh
, is another S-shaped function similar to sigmoid, but it maps inputs to the range (-1, 1).
Its mathematical form is:
tanh(x)=ex+e−xex−e−x=2σ(2x)−1
Properties:
tanh
's output is zero-centered, which can be advantageous for optimization as gradients are less likely to be biased in one direction.tanh
also suffers from the vanishing gradient problem for large positive or negative inputs, though its gradients are generally steeper than sigmoid's.In Flux.jl, you use tanh
:
using Flux
input_data = randn(Float32, 5)
output_data = tanh.(input_data)
println(output_data)
layer = Dense(10, 5, tanh) # Dense layer with tanh activation
The hyperbolic tangent (tanh) activation function, mapping inputs to the (-1, 1) range. Its zero-centered output is often beneficial.
The Rectified Linear Unit, or ReLU, has become a standard activation function for hidden layers in many types of neural networks due to its simplicity and effectiveness. Its mathematical form is: ReLU(x)=max(0,x)
Properties:
In Flux.jl, relu
is the function:
using Flux
input_data = randn(Float32, 5)
output_data = relu.(input_data)
println(output_data)
layer = Dense(10, 5, relu) # Dense layer with ReLU activation
The Rectified Linear Unit (ReLU) activation function. It outputs the input directly if positive, and zero otherwise.
To address the "dying ReLU" problem, several variants of ReLU have been proposed. One of the most common is Leaky ReLU.
Leaky ReLU: Allows a small, non-zero gradient when the unit is not active. Its mathematical form is: LeakyReLU(x)={xαxif x>0if x≤0 where α is a small positive constant, typically around 0.01 to 0.2.
Properties:
Flux.jl provides leakyrelu
. The α parameter can be specified (defaulting to 0.01).
using Flux
input_data = randn(Float32, 5)
output_alpha_default = leakyrelu.(input_data) # Default alpha = 0.01
output_alpha_custom = leakyrelu.(input_data, 0.2f0) # Custom alpha = 0.2
println("Leaky ReLU (alpha=0.01): ", output_alpha_default)
println("Leaky ReLU (alpha=0.2): ", output_alpha_custom)
layer = Dense(10, 5, x -> leakyrelu(x, 0.1f0)) # Using leakyrelu with custom alpha in a layer
Other variants include:
prelu
function that learns α as part of the Dense
layer's activation argument, but custom layers can implement this.elu(x, α=1.0f0)
. ELU can sometimes lead to faster learning and better generalization than ReLU.Comparison of ReLU and Leaky ReLU. Leaky ReLU introduces a small slope for negative inputs.
The softmax function is typically used in the output layer of a neural network for multi-class classification problems. It converts a vector of raw scores (logits) into a probability distribution over K different classes. For a vector x of K logits, the softmax for the j-th element is: softmax(xj)=∑k=1Kexkexj
Properties:
In Flux.jl, softmax
operates on an array, typically along a specific dimension if dealing with batches of data. For a single instance (a vector), it computes the standard softmax.
using Flux
logits = randn(Float32, 3) # Example logits for 3 classes
probabilities = softmax(logits)
println("Logits: ", logits)
println("Probabilities: ", probabilities)
println("Sum of Probabilities: ", sum(probabilities)) # Should be close to 1.0
# In a model for 3-class classification:
model = Chain(
Dense(10, 3), # Output layer with 3 neurons (one per class)
softmax # Apply softmax to get probabilities
)
# Note: When using with cross-entropy loss, often the logits are passed directly to the loss function
# (e.g., `Flux.logitcrossentropy`), which internally applies softmax or a stable equivalent.
# However, for getting direct probability outputs from the model, softmax is applied.
It's important to note that when using loss functions like Flux.logitcrossentropy
, you often pass the raw logits (output of the Dense
layer before softmax) directly to the loss function. This is because logitcrossentropy
combines the softmax operation with the cross-entropy calculation for better numerical stability and efficiency. However, if you need the actual probability outputs from your model during inference, you would apply softmax
explicitly.
As seen in the examples, Flux.jl makes it straightforward to incorporate activation functions into your layers. The Dense
layer, for instance, accepts an activation function as its third argument:
# A Dense layer with 10 input features, 20 output features, and ReLU activation
hidden_layer = Dense(10, 20, relu)
# An output layer for binary classification, 20 inputs, 1 output, sigmoid activation
output_layer_binary = Dense(20, 1, sigmoid)
If you omit the activation function in a Dense
layer, it defaults to identity
, which means no activation is applied (a linear layer).
linear_layer = Dense(5, 5) # Equivalent to Dense(5, 5, identity)
You can also apply activation functions directly to the output of a layer or any array using Julia's broadcasting syntax:
using Flux
# Example: Applying relu after a linear layer calculation
W = randn(Float32, 3, 5) # Weight matrix
b = randn(Float32, 3) # Bias vector
x = randn(Float32, 5) # Input vector
z = W * x .+ b # Linear transformation
h = relu.(z) # Apply relu element-wise
println(h)
This element-wise application is fundamental to how activation functions work on the outputs of neurons.
Choosing the right activation function can significantly impact your model's performance, but there are no universal rules. However, some general guidelines and common practices exist:
sigmoid
to get a probability output for the positive class.softmax
to get a probability distribution over all classes.identity
) is used in the output layer if the output can take any real value. If the output is constrained (e.g., always positive), an appropriate function like relu
or softplus
might be considered.As you build more complex networks in Flux.jl, you'll become more familiar with these functions and develop an intuition for which ones to choose. Remember that the flexibility of Julia and Flux allows you to even define custom activation functions if your application demands something unique. The next sections will cover loss functions and optimizers, which work in tandem with your network architecture and activation functions to train your models effectively.
Was this section helpful?
© 2025 ApX Machine Learning