Activation Functions: Adding Non-linearity

Remi Genet

Activation Functions: Adding Non-linearity

Cours

Fundamentals

Understanding activation functions, their mathematical properties, and roles in neural networks.

Author

Remi Genet

Published

2025-04-03

Activation Functions: The Non-linear Component

Section 1.36 - Why We Need Activations

Role in Neural Networks

Add Non-linearity: Without activations, neural networks would be just linear combinations.
Control Output Range: Normalize or bound outputs to specific ranges.
Enable Gradient Flow: Different activations affect how gradients propagate.

Impact of Non-linearity

Let’s understand why non-linear activations are crucial:

Linear Case (No activation):

# Layer 1
h1 = w11*x1 + w12*x2 + b1
# Layer 2
output = w21*h1 + b2

# Substituting h1 into output:
output = w21*(w11*x1 + w12*x2 + b1) + b2
      = (w21*w11)*x1 + (w21*w12)*x2 + (w21*b1 + b2)
      = W1*x1 + W2*x2 + B  # Still just a linear function!

No matter how many layers we stack, without activation functions, we just get another linear function. This means we can’t learn complex patterns!

Non-linear Case (With ReLU):

# Layer 1
h1 = relu(w11*x1 + w12*x2 + b1)
     = max(0, w11*x1 + w12*x2 + b1)
# Layer 2
output = w21*h1 + b2

Now the network can learn:

Different behaviors for positive/negative inputs
AND/OR-like logic gates
Curved decision boundaries
Complex feature interactions

Visual Example: XOR Function

Consider an XOR function:

# Truth table
# x1  x2  | output
# 0   0   |   0
# 0   1   |   1
# 1   0   |   1
# 1   1   |   0

# XOR implementation with ReLU
# Layer 1: Create two decision boundaries
h1 = relu(x1 + x2 - 1.5)      # Activates when both inputs are 1
h2 = relu(x1 + x2 - 0.5)      # Activates when at least one input is 1

# Layer 2: Combine to get XOR
output = h2 - 2*h1  # Subtract "both-1" case from "at-least-one-1" case

This demonstrates how non-linear activations allow the network to:

Create complex decision boundaries
Learn hierarchical features
Approximate any continuous function (Universal Approximation Theorem)

Section 1.37 - Common Activation Functions

1. ReLU (Rectified Linear Unit)

\[ f(x) = \max(0, x) \]

Properties:

Easy gradient computation
Sparse activation (outputs exact zeros)
No upper bound
Default choice for hidden layers

Use Cases:

Hidden layers in most networks
Deep networks (helps with vanishing gradient)

2. Sigmoid

\[ f(x) = \frac{1}{1 + e^{-x}} \]

Properties:

Output range: (0, 1)
Smooth gradient
Can saturate (gradient ≈ 0 for large \(\lvert x \rvert\))

Use Cases:

Binary classification output layer
Gates in LSTM/GRU cells

3. Tanh (Hyperbolic Tangent)

\[ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

Properties:

Output range: (-1, 1)
Zero-centered
Stronger gradients than sigmoid

Use Cases:

Hidden layers in shallow networks
RNN/LSTM states
Normalized outputs

4. Softmax

\[ f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} \]

Properties:

Outputs sum to 1
Emphasizes largest values
Converts logits to probabilities

Use Cases:

Multiclass classification output layer
Attention mechanisms

5. LeakyReLU

\[ f(x) = \max(\alpha x, x) \quad \text{where } \alpha \approx 0.01 \]

Properties:

Like ReLU but allows small negative gradients
Helps prevent the “dying ReLU” problem

Use Cases:

Alternative to ReLU in deep networks
When dead neurons are an issue

Section 1.38 - Choosing Activation Functions

Guidelines by Layer Type

Hidden Layers:

Start with ReLU (default choice)
Try LeakyReLU if you see dead neurons
Use Tanh for shallow networks or RNNs

Output Layer:

Task Type	Activation	Output Range
Binary Classification	Sigmoid	(0, 1)
Multiclass	Softmax	[0, 1] (sum to 1)
Regression	Linear (none)	\[(-\infty, \infty)\]
Bounded Regression	Tanh	\[(-1, 1)\]

Implementation in Keras

from keras import layers

# As string
layer = layers.Dense(64, activation='relu')

# As function
from keras.activations import relu
layer = layers.Dense(64, activation=relu)

# As layer
layer = layers.Dense(64)
activation = layers.Activation('relu')

Section 1.39 - Universal Approximation

Why Neural Networks Can Approximate Any Function

The Universal Approximation Theorem states that a neural network with:

One hidden layer (wide enough)
Any continuous activation function
A linear output layer

can approximate any continuous function on a compact domain to arbitrary precision.

Example: Approximating a Complex Function

Consider \[ f(x) = x^2 \]:

# Can be approximated by a ReLU network:
def approx_square(x):
    # Break into many linear pieces
    h1 = relu(x - 0.0) * 0.0    # Slope at x=0
    h2 = relu(x - 0.5) * 1.0    # Slope at x=0.5
    h3 = relu(x - 1.0) * 2.0    # Slope at x=1.0
    return h1 + h2 + h3         # Combine pieces

Key Points:

More neurons = finer approximation
Deeper networks = more efficient approximation
Choice of activation affects:
- How many neurons are needed
- How easy it is to train
- The quality of approximation

Practical Implications:

The theoretical capability does not necessarily mean it’s easy to train.
ReLU often works best in practice because:
- Its linear pieces are easy to optimize
- It promotes sparse activation (many exact zeros)
- It provides strong gradient flow when active

Modern Variants

GELU (Gaussian Error Linear Unit)

\[ f(x) = x \cdot \Phi(x) \]

where \[\Phi(x)\] is the cumulative distribution function of the standard normal distribution. - Smooth transitions like sigmoid - Used in transformers (BERT, GPT)

SiLU/Swish (Sigmoid Linear Unit)

\[ f(x) = x \cdot \operatorname{sigmoid}(x) = \frac{x}{1 + e^{-x}} \]

Similar to GELU but computationally simpler
Strong performance in deep networks

SELU (Scaled Exponential Linear Unit)

\[ f(x) = \lambda \begin{cases} x, & \text{if } x > 0, \\ \alpha \left(e^x - 1\right), & \text{if } x \leq 0, \end{cases} \]

where \[\lambda \approx 1.0507\] and \[\alpha \approx 1.6733\]. - Self-normalizing properties - Good for deep fully-connected networks

Important Note

While there are many activation functions, ReLU is the standard choice for most applications. Only switch if you have a specific reason or observe specific problems in your network.

Back to top