All Courses

Activation Functions (ReLU, Sigmoid, Tanh)

Neural networks gain much of their representational capability from the introduction of non-linearities between layers. If we simply stacked linear transformations (like nn.Linear layers) without any intervening functions, the entire network would collapse into a single, equivalent linear transformation. No matter how many layers deep, the network could only learn linear relationships between inputs and outputs.

Activation functions are the components that introduce these essential non-linearities. Applied element-wise to the output of a layer (often called the pre-activation or logit), they transform the values before passing them to the next layer. PyTorch provides a wide variety of activation functions within the torch.nn module, typically used by instantiating them as layers within your model definition. Let's look at three of the most common ones: ReLU, Sigmoid, and Tanh.

ReLU (Rectified Linear Unit)

The Rectified Linear Unit, or ReLU, is arguably the most popular activation function in modern deep learning, especially in convolutional neural networks. Its definition is remarkably simple: it outputs the input directly if it's positive, and outputs zero otherwise.

Mathematically, it's defined as:

\text{ReLU}(x) = \max(0, x)

In PyTorch, you can use nn.ReLU:

import torch
import torch.nn as nn

# Example usage
relu_activation = nn.ReLU()
input_tensor = torch.randn(4) # Example input tensor
output_tensor = relu_activation(input_tensor)

print(f"Input: {input_tensor}")
print(f"Output after ReLU: {output_tensor}")

# Example within a simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(10, 20)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(20, 5)

    def forward(self, x):
        x = self.layer1(x)
        x = self.activation(x) # Apply ReLU
        x = self.layer2(x)
        return x

model = SimpleNet()

The ReLU function is zero for negative inputs and linear for positive inputs.

Advantages:

Computational Efficiency: Very simple calculation ( $\max(0, x)$ ).
Reduces Vanishing Gradients: For positive inputs, the gradient is 1, which helps gradients flow backward during training compared to functions like Sigmoid or Tanh that saturate.
Induces Sparsity: Since negative inputs are mapped to zero, it can lead to sparse activations in the network, which can sometimes be beneficial.

Disadvantages:

Dying ReLU Problem: Neurons whose inputs consistently fall into the negative region will output zero. Consequently, the gradient flowing through them will also be zero, meaning their weights won't be updated during backpropagation. These neurons effectively "die" and stop contributing to learning. Variants like Leaky ReLU or Parametric ReLU (PReLU) attempt to address this.
Not Zero-Centered: The outputs are always non-negative.

Sigmoid

The Sigmoid function, sometimes called the logistic function, squashes its input into a range between 0 and 1. It was historically popular, especially in the output layer of binary classification models where the output represents a probability.

Its mathematical form is:

\sigma(x) = \frac{1}{1 + e^{-x}}

In PyTorch, use nn.Sigmoid:

import torch
import torch.nn as nn

# Example usage
sigmoid_activation = nn.Sigmoid()
input_tensor = torch.randn(4) # Example input tensor
output_tensor = sigmoid_activation(input_tensor)

print(f"Input: {input_tensor}")
print(f"Output after Sigmoid: {output_tensor}")

The Sigmoid function smoothly maps any real number to the range (0, 1).

Advantages:

Output Interpretation: The (0, 1) range is convenient for representing probabilities.
Smooth Gradient: The function is differentiable everywhere, providing a smooth gradient.

Disadvantages:

Vanishing Gradients: For very large positive or negative inputs, the function saturates (output approaches 1 or 0), and the gradient becomes extremely close to zero. This can severely slow down or halt learning in deep networks, as gradients struggle to propagate back through many layers.
Not Zero-Centered: Outputs are always positive, which can sometimes slow down convergence compared to zero-centered activation functions.
Computationally More Expensive: The exponential function is more costly than ReLU's simple comparison.

Due to the vanishing gradient problem, Sigmoid is less commonly used in hidden layers of deep networks today compared to ReLU, but it remains relevant for output layers in specific tasks like binary classification or multi-label classification.

Tanh (Hyperbolic Tangent)

The hyperbolic tangent, or Tanh function, is mathematically related to Sigmoid but squashes its input into the range (-1, 1).

It's defined as:

\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = 2 \sigma(2x) - 1

In PyTorch, use nn.Tanh:

import torch
import torch.nn as nn

# Example usage
tanh_activation = nn.Tanh()
input_tensor = torch.randn(4) # Example input tensor
output_tensor = tanh_activation(input_tensor)

print(f"Input: {input_tensor}")
print(f"Output after Tanh: {output_tensor}")

The Tanh function smoothly maps any real number to the range (-1, 1).

Advantages:

Zero-Centered Output: Unlike Sigmoid, Tanh's output is centered around zero, which often helps with model convergence during training. Data that is zero-centered tends to work better with gradient-based optimization methods.
Smooth Gradient: Like Sigmoid, it's differentiable everywhere.

Disadvantages:

Vanishing Gradients: Similar to Sigmoid, Tanh also suffers from saturation at large positive or negative input values, leading to vanishing gradients in deep networks. While often preferred over Sigmoid in hidden layers due to its zero-centered nature, it's still susceptible to this issue.
Computationally More Expensive: Involves exponential functions, making it more costly than ReLU.

Tanh was often preferred over Sigmoid for hidden layers before the rise of ReLU, mainly because of its zero-centered output range. It's still commonly found in recurrent neural networks (RNNs) and LSTMs.

Choosing an Activation Function

There's no single "best" activation function for all scenarios. However, some general guidelines are:

ReLU is often the default choice for hidden layers in feed-forward networks and CNNs due to its efficiency and effectiveness in mitigating vanishing gradients for positive inputs. Start with ReLU and consider alternatives if you encounter issues like dying neurons.
Leaky ReLU or Parametric ReLU (PReLU) can be good alternatives if the "dying ReLU" problem is suspected. They introduce a small, non-zero slope for negative inputs.
Tanh can be effective in hidden layers, especially in RNNs, due to its zero-centered output.
Sigmoid is typically reserved for the output layer when you need probabilities for binary or multi-label classification. Avoid using it extensively in deep hidden layers due to the vanishing gradient problem.

Experimentation is often necessary to find the optimal activation function for a specific architecture and dataset. In PyTorch, swapping activation functions is straightforward, usually involving changing just one line where the activation module is instantiated or called within your nn.Module's forward method.

Was this section helpful?