All Courses

Working with Activation Functions in Flux

Neural networks derive much of their expressive capability from non-linear activation functions. Without them, a multi-layered network would, in essence, collapse into a single linear transformation, severely limiting its ability to model complex relationships in data. Activation functions introduce these necessary non-linearities, applied typically after the linear transformation (weights and biases) within a neuron or layer. As we build models with Flux.jl, understanding and correctly applying these functions is a significant step towards effective network design.

Flux.jl provides a suite of common activation functions that are readily available and optimized for performance. These functions are standard Julia functions and can be applied element-wise to arrays or incorporated directly into layer definitions. This flexibility allows for both standard network constructions and more experimental architectures.

Let's examine some of the most frequently used activation functions, their properties, and how to use them in Flux.

Common Activation Functions Explored

Each activation function has distinct characteristics that influence a neural network's learning dynamics and performance.

Sigmoid

The sigmoid function, also known as the logistic function, maps any real-valued number into a range between 0 and 1. Its mathematical form is: $\sigma(x) = \frac{1}{1 + e^{-x}}$ Historically, sigmoid was a popular choice for hidden layers, but it has largely been superseded by functions like ReLU due to certain drawbacks.

Properties:

Output Range: (0, 1). This makes it suitable for output layers in binary classification problems where the output can be interpreted as a probability.
Non-linearity: Introduces non-linearity, allowing networks to learn complex functions.
Vanishing Gradients: The function saturates at both extremes (for very large positive or negative inputs), meaning its derivative becomes very close to zero. This can lead to the vanishing gradient problem during backpropagation, slowing down or halting learning in deep networks.
Not Zero-Centered: Its output is always positive, which can lead to inefficiencies in gradient updates during training.

In Flux.jl, you can use sigmoid:

using Flux

input_data = randn(Float32, 5) # Example input
output_data = sigmoid.(input_data) # Apply sigmoid element-wise
println(output_data)

# Within a layer
layer = Dense(10, 5, sigmoid) # A dense layer with 5 output neurons using sigmoid activation

The sigmoid activation function, mapping inputs to the (0, 1) range.

Hyperbolic Tangent (tanh)

The hyperbolic tangent function, or tanh, is another S-shaped function similar to sigmoid, but it maps inputs to the range (-1, 1). Its mathematical form is: $\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$

Properties:

Output Range: (-1, 1).
Zero-Centered: Unlike sigmoid, tanh's output is zero-centered, which can be advantageous for optimization as gradients are less likely to be biased in one direction.
Vanishing Gradients: Like sigmoid, tanh also suffers from the vanishing gradient problem for large positive or negative inputs, though its gradients are generally steeper than sigmoid's.

In Flux.jl, you use tanh:

using Flux

input_data = randn(Float32, 5)
output_data = tanh.(input_data)
println(output_data)

layer = Dense(10, 5, tanh) # Dense layer with tanh activation

The hyperbolic tangent (tanh) activation function, mapping inputs to the (-1, 1) range. Its zero-centered output is often beneficial.

Rectified Linear Unit (ReLU)

The Rectified Linear Unit, or ReLU, has become a standard activation function for hidden layers in many types of neural networks due to its simplicity and effectiveness. Its mathematical form is: $\text{ReLU}(x) = \text{max}(0, x)$

Properties:

Output Range: [0, $\infty$ ).
Non-linearity: Introduces non-linearity despite its piecewise linear form.
Computational Efficiency: Very simple to compute (a threshold operation).
Mitigates Vanishing Gradients (for positive inputs): For positive inputs, the gradient is constant (1), which helps gradients propagate effectively during backpropagation.
Sparsity: Can lead to sparse activations, as neurons outputting zero for negative inputs are effectively "turned off." This can improve computational efficiency and generalization.
Dying ReLU Problem: If a neuron's input consistently falls below zero, it will always output zero. Consequently, its gradient will be zero, and its weights will not be updated. This neuron effectively "dies" and ceases to contribute to learning.
Not Zero-Centered: Similar to sigmoid, its outputs are not zero-centered.

In Flux.jl, relu is the function:

using Flux

input_data = randn(Float32, 5)
output_data = relu.(input_data)
println(output_data)

layer = Dense(10, 5, relu) # Dense layer with ReLU activation

The Rectified Linear Unit (ReLU) activation function. It outputs the input directly if positive, and zero otherwise.

Leaky ReLU and its Variants

To address the "dying ReLU" problem, several variants of ReLU have been proposed. One of the most common is Leaky ReLU.

Leaky ReLU: Allows a small, non-zero gradient when the unit is not active. Its mathematical form is: $\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \le 0 \end{cases}$ where $\alpha$ is a small positive constant, typically around 0.01 to 0.2.

Properties:

Addresses Dying ReLU: By allowing a small negative slope, it prevents neurons from becoming completely inactive.
Maintains ReLU's benefits: Retains computational efficiency and good gradient flow for positive inputs.

Flux.jl provides leakyrelu. The $\alpha$ parameter can be specified (defaulting to 0.01).

using Flux

input_data = randn(Float32, 5)
output_alpha_default = leakyrelu.(input_data) # Default alpha = 0.01
output_alpha_custom = leakyrelu.(input_data, 0.2f0) # Custom alpha = 0.2
println("Leaky ReLU (alpha=0.01): ", output_alpha_default)
println("Leaky ReLU (alpha=0.2): ", output_alpha_custom)

layer = Dense(10, 5, x -> leakyrelu(x, 0.1f0)) # Using leakyrelu with custom alpha in a layer

Other variants include:

Parametric ReLU (PReLU): $\alpha$ is a learnable parameter. Flux does not have a direct prelu function that learns $\alpha$ as part of the Dense layer's activation argument, but custom layers can implement this.
Exponential Linear Unit (ELU): Uses an exponential function for negative inputs. Flux provides elu(x, α=1.0f0). ELU can sometimes lead to faster learning and better generalization than ReLU.

Comparison of ReLU and Leaky ReLU. Leaky ReLU introduces a small slope for negative inputs.

Softmax

The softmax function is typically used in the output layer of a neural network for multi-class classification problems. It converts a vector of raw scores (logits) into a probability distribution over $K$ different classes. For a vector $x$ of $K$ logits, the softmax for the $j$ -th element is: $\text{softmax}(x_j) = \frac{e^{x_j}}{\sum_{k=1}^{K} e^{x_k}}$

Properties:

Output Range: Each output value is in the range (0, 1).
Probability Distribution: The sum of all output values is 1, making it suitable for representing probabilities.
Multi-class Classification: Ideal for selecting one class from multiple possibilities. The class with the highest probability is typically chosen as the prediction.
Not usually used in hidden layers.

In Flux.jl, softmax operates on an array, typically along a specific dimension if dealing with batches of data. For a single instance (a vector), it computes the standard softmax.

using Flux

logits = randn(Float32, 3) # Example logits for 3 classes
probabilities = softmax(logits)
println("Logits: ", logits)
println("Probabilities: ", probabilities)
println("Sum of Probabilities: ", sum(probabilities)) # Should be close to 1.0

# In a model for 3-class classification:
model = Chain(
  Dense(10, 3), # Output layer with 3 neurons (one per class)
  softmax       # Apply softmax to get probabilities
)
# Note: When using with cross-entropy loss, often the logits are passed directly to the loss function
# (e.g., `Flux.logitcrossentropy`), which internally applies softmax or a stable equivalent.
# However, for getting direct probability outputs from the model, softmax is applied.

It's important to note that when using loss functions like Flux.logitcrossentropy, you often pass the raw logits (output of the Dense layer before softmax) directly to the loss function. This is because logitcrossentropy combines the softmax operation with the cross-entropy calculation for better numerical stability and efficiency. However, if you need the actual probability outputs from your model during inference, you would apply softmax explicitly.

Applying Activation Functions in Flux Layers

As seen in the examples, Flux.jl makes it straightforward to incorporate activation functions into your layers. The Dense layer, for instance, accepts an activation function as its third argument:

# A Dense layer with 10 input features, 20 output features, and ReLU activation
hidden_layer = Dense(10, 20, relu)

# An output layer for binary classification, 20 inputs, 1 output, sigmoid activation
output_layer_binary = Dense(20, 1, sigmoid)

If you omit the activation function in a Dense layer, it defaults to identity, which means no activation is applied (a linear layer).

linear_layer = Dense(5, 5) # Equivalent to Dense(5, 5, identity)

You can also apply activation functions directly to the output of a layer or any array using Julia's broadcasting syntax:

using Flux

# Example: Applying relu after a linear layer calculation
W = randn(Float32, 3, 5) # Weight matrix
b = randn(Float32, 3)    # Bias vector
x = randn(Float32, 5)    # Input vector

z = W * x .+ b    # Linear transformation
h = relu.(z)      # Apply relu element-wise
println(h)

This element-wise application is fundamental to how activation functions work on the outputs of neurons.

Guidance on Selecting Activation Functions

Choosing the right activation function can significantly impact your model's performance, but there are no universal rules. However, some general guidelines and common practices exist:

ReLU is often a good default for hidden layers. It's computationally efficient and generally works well. Start with ReLU and consider its variants (Leaky ReLU, ELU) if you encounter issues like dying neurons.
For output layers:
- Binary Classification: Use sigmoid to get a probability output for the positive class.
- Multi-class Classification: Use softmax to get a probability distribution over all classes.
- Regression: Typically, no activation function (or identity) is used in the output layer if the output can take any real value. If the output is constrained (e.g., always positive), an appropriate function like relu or softplus might be considered.
Avoid sigmoid and tanh in deep hidden layers if possible due to the vanishing gradient problem. ReLU and its variants are generally preferred.
Experimentation is common. The choice of activation function can be treated as a hyperparameter. Try different functions and see what works best for your specific problem and architecture.

As you build more complex networks in Flux.jl, you'll become more familiar with these functions and develop an intuition for which ones to choose. Remember that the flexibility of Julia and Flux allows you to even define custom activation functions if your application demands something unique. The next sections will cover loss functions and optimizers, which work in tandem with your network architecture and activation functions to train your models effectively.

Was this section helpful?