DeepLearning For Finance
  • Back to Main Website
  • Home
  • Introduction to Deep Learning
    • Introduction to Deep Learning
    • From Traditional Models to Deep Learning
    • The Multi-Layer Perceptron (MLP)
    • Automatic Differentiation: The Engine of Deep Learning
    • Computation Backends & Keras 3
    • GPUs and Deep Learning: When Hardware Matters
    • Keras Fundamentals: Models & Layers
    • Keras Matrix Operations: The Building Blocks
    • Activation Functions: Adding Non-linearity
    • Model Training Fundamentals

    • Travaux Pratiques
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
  • Recurrent Neural Networks
    • Recurrent Neural Networks
    • Sequential Data Processing: From MLPs to RNNs
    • Long Short-Term Memory Networks (LSTM)
    • Modern RNN Architectures
    • RNN Limitations: Computational Challenges

    • Travaux Pratiques
    • TP: Recurrent Neural Networks for Time Series Prediction
    • TP Corrected: Recurrent Neural Networks for Time Series Prediction
  • Training a Neural Network
    • Training a Neural Network
    • Understanding the Training Loop
    • Understanding Optimizers
    • Understanding Callbacks
    • Training Parameters and Practical Considerations

    • Travaux Pratiques
    • TP: Using Deep Learning Frameworks for General Optimization
    • tp_general_optimization_corrected.html
    • TP: Impact of Callbacks on Training
  • Essential Building Blocks of Modern Neural Networks
    • Essential Building Blocks of Modern Neural Networks
    • Residual Connections and Gating Mechanisms
    • Convolutional Layers: From Images to Time Series
    • Neural Network Embeddings: Learning Meaningful Representations
    • Attention Mechanisms: Learning What to Focus On
    • Encoder-Decoder Architectures

    • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
  • Projets
    • Projets
  • Code source
  1. Activation Functions: Adding Non-linearity
  • Introduction to Deep Learning
  • From Traditional Models to Deep Learning
  • The Multi-Layer Perceptron (MLP)
  • Automatic Differentiation: The Engine of Deep Learning
  • Computation Backends & Keras 3
  • GPUs and Deep Learning: When Hardware Matters
  • Keras Fundamentals: Models & Layers
  • Keras Matrix Operations: The Building Blocks
  • Activation Functions: Adding Non-linearity
  • Model Training Fundamentals
  • Travaux Pratiques
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations

On this page

  • Activation Functions: The Non-linear Component
    • Section 1.36 - Why We Need Activations
      • Role in Neural Networks
    • Section 1.37 - Common Activation Functions
      • 1. ReLU (Rectified Linear Unit)
      • 2. Sigmoid
      • 3. Tanh (Hyperbolic Tangent)
      • 4. Softmax
      • 5. LeakyReLU
    • Section 1.38 - Choosing Activation Functions
      • Guidelines by Layer Type
      • Implementation in Keras
    • Section 1.39 - Universal Approximation
      • Why Neural Networks Can Approximate Any Function
      • Modern Variants

Activation Functions: Adding Non-linearity

Cours
Fundamentals
Understanding activation functions, their mathematical properties, and roles in neural networks.
Author

Remi Genet

Published

2025-04-03

Activation Functions: The Non-linear Component

Section 1.36 - Why We Need Activations

Role in Neural Networks

  1. Add Non-linearity: Without activations, neural networks would be just linear combinations.
  2. Control Output Range: Normalize or bound outputs to specific ranges.
  3. Enable Gradient Flow: Different activations affect how gradients propagate.
Impact of Non-linearity

Let’s understand why non-linear activations are crucial:

  1. Linear Case (No activation):
# Layer 1
h1 = w11*x1 + w12*x2 + b1
# Layer 2
output = w21*h1 + b2

# Substituting h1 into output:
output = w21*(w11*x1 + w12*x2 + b1) + b2
      = (w21*w11)*x1 + (w21*w12)*x2 + (w21*b1 + b2)
      = W1*x1 + W2*x2 + B  # Still just a linear function!

No matter how many layers we stack, without activation functions, we just get another linear function. This means we can’t learn complex patterns!

  1. Non-linear Case (With ReLU):
# Layer 1
h1 = relu(w11*x1 + w12*x2 + b1)
     = max(0, w11*x1 + w12*x2 + b1)
# Layer 2
output = w21*h1 + b2

Now the network can learn:

  • Different behaviors for positive/negative inputs
  • AND/OR-like logic gates
  • Curved decision boundaries
  • Complex feature interactions
Visual Example: XOR Function

Consider an XOR function:

# Truth table
# x1  x2  | output
# 0   0   |   0
# 0   1   |   1
# 1   0   |   1
# 1   1   |   0

# XOR implementation with ReLU
# Layer 1: Create two decision boundaries
h1 = relu(x1 + x2 - 1.5)      # Activates when both inputs are 1
h2 = relu(x1 + x2 - 0.5)      # Activates when at least one input is 1

# Layer 2: Combine to get XOR
output = h2 - 2*h1  # Subtract "both-1" case from "at-least-one-1" case

This demonstrates how non-linear activations allow the network to:

  1. Create complex decision boundaries
  2. Learn hierarchical features
  3. Approximate any continuous function (Universal Approximation Theorem)

Section 1.37 - Common Activation Functions

1. ReLU (Rectified Linear Unit)

\[ f(x) = \max(0, x) \]

Properties:

  • Easy gradient computation
  • Sparse activation (outputs exact zeros)
  • No upper bound
  • Default choice for hidden layers

Use Cases:

  • Hidden layers in most networks
  • Deep networks (helps with vanishing gradient)

2. Sigmoid

\[ f(x) = \frac{1}{1 + e^{-x}} \]

Properties:

  • Output range: (0, 1)
  • Smooth gradient
  • Can saturate (gradient ≈ 0 for large \(\lvert x \rvert\))

Use Cases:

  • Binary classification output layer
  • Gates in LSTM/GRU cells

3. Tanh (Hyperbolic Tangent)

\[ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

Properties:

  • Output range: (-1, 1)
  • Zero-centered
  • Stronger gradients than sigmoid

Use Cases:

  • Hidden layers in shallow networks
  • RNN/LSTM states
  • Normalized outputs

4. Softmax

\[ f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} \]

Properties:

  • Outputs sum to 1
  • Emphasizes largest values
  • Converts logits to probabilities

Use Cases:

  • Multiclass classification output layer
  • Attention mechanisms

5. LeakyReLU

\[ f(x) = \max(\alpha x, x) \quad \text{where } \alpha \approx 0.01 \]

Properties:

  • Like ReLU but allows small negative gradients
  • Helps prevent the “dying ReLU” problem

Use Cases:

  • Alternative to ReLU in deep networks
  • When dead neurons are an issue

Section 1.38 - Choosing Activation Functions

Guidelines by Layer Type

Hidden Layers:

  1. Start with ReLU (default choice)
  2. Try LeakyReLU if you see dead neurons
  3. Use Tanh for shallow networks or RNNs

Output Layer:

Task Type Activation Output Range
Binary Classification Sigmoid (0, 1)
Multiclass Softmax [0, 1] (sum to 1)
Regression Linear (none) \[(-\infty, \infty)\]
Bounded Regression Tanh \[(-1, 1)\]

Implementation in Keras

from keras import layers

# As string
layer = layers.Dense(64, activation='relu')

# As function
from keras.activations import relu
layer = layers.Dense(64, activation=relu)

# As layer
layer = layers.Dense(64)
activation = layers.Activation('relu')

Section 1.39 - Universal Approximation

Why Neural Networks Can Approximate Any Function

The Universal Approximation Theorem states that a neural network with:

  • One hidden layer (wide enough)
  • Any continuous activation function
  • A linear output layer

can approximate any continuous function on a compact domain to arbitrary precision.

Example: Approximating a Complex Function

Consider \[ f(x) = x^2 \]:

# Can be approximated by a ReLU network:
def approx_square(x):
    # Break into many linear pieces
    h1 = relu(x - 0.0) * 0.0    # Slope at x=0
    h2 = relu(x - 0.5) * 1.0    # Slope at x=0.5
    h3 = relu(x - 1.0) * 2.0    # Slope at x=1.0
    return h1 + h2 + h3         # Combine pieces

Key Points:

  1. More neurons = finer approximation
  2. Deeper networks = more efficient approximation
  3. Choice of activation affects:
    • How many neurons are needed
    • How easy it is to train
    • The quality of approximation

Practical Implications:

  • The theoretical capability does not necessarily mean it’s easy to train.
  • ReLU often works best in practice because:
    • Its linear pieces are easy to optimize
    • It promotes sparse activation (many exact zeros)
    • It provides strong gradient flow when active

Modern Variants

  1. GELU (Gaussian Error Linear Unit)

\[ f(x) = x \cdot \Phi(x) \]

where \[\Phi(x)\] is the cumulative distribution function of the standard normal distribution. - Smooth transitions like sigmoid - Used in transformers (BERT, GPT)

  1. SiLU/Swish (Sigmoid Linear Unit)

\[ f(x) = x \cdot \operatorname{sigmoid}(x) = \frac{x}{1 + e^{-x}} \]

  • Similar to GELU but computationally simpler
  • Strong performance in deep networks
  1. SELU (Scaled Exponential Linear Unit)

\[ f(x) = \lambda \begin{cases} x, & \text{if } x > 0, \\ \alpha \left(e^x - 1\right), & \text{if } x \leq 0, \end{cases} \]

where \[\lambda \approx 1.0507\] and \[\alpha \approx 1.6733\]. - Self-normalizing properties - Good for deep fully-connected networks

Important Note

While there are many activation functions, ReLU is the standard choice for most applications. Only switch if you have a specific reason or observe specific problems in your network.

Back to top
Keras Matrix Operations: The Building Blocks
Model Training Fundamentals

Deep Learning For Finance, Rémi Genet.
Licence
Code source disponible sur Github

 

Site construit avec et Quarto
Inspiration pour la mise en forme du site ici
Code source disponible sur GitHub