DeepLearning For Finance
  • Back to Main Website
  • Home
  • Introduction to Deep Learning
    • Introduction to Deep Learning
    • From Traditional Models to Deep Learning
    • The Multi-Layer Perceptron (MLP)
    • Automatic Differentiation: The Engine of Deep Learning
    • Computation Backends & Keras 3
    • GPUs and Deep Learning: When Hardware Matters
    • Keras Fundamentals: Models & Layers
    • Keras Matrix Operations: The Building Blocks
    • Activation Functions: Adding Non-linearity
    • Model Training Fundamentals

    • Travaux Pratiques
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
  • Recurrent Neural Networks
    • Recurrent Neural Networks
    • Sequential Data Processing: From MLPs to RNNs
    • Long Short-Term Memory Networks (LSTM)
    • Modern RNN Architectures
    • RNN Limitations: Computational Challenges

    • Travaux Pratiques
    • TP: Recurrent Neural Networks for Time Series Prediction
    • TP Corrected: Recurrent Neural Networks for Time Series Prediction
  • Training a Neural Network
    • Training a Neural Network
    • Understanding the Training Loop
    • Understanding Optimizers
    • Understanding Callbacks
    • Training Parameters and Practical Considerations

    • Travaux Pratiques
    • TP: Using Deep Learning Frameworks for General Optimization
    • tp_general_optimization_corrected.html
    • TP: Impact of Callbacks on Training
  • Essential Building Blocks of Modern Neural Networks
    • Essential Building Blocks of Modern Neural Networks
    • Residual Connections and Gating Mechanisms
    • Convolutional Layers: From Images to Time Series
    • Neural Network Embeddings: Learning Meaningful Representations
    • Attention Mechanisms: Learning What to Focus On
    • Encoder-Decoder Architectures

    • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
  • Projets
    • Projets
  • Code source
  1. Residual Connections and Gating Mechanisms
  • Essential Building Blocks of Modern Neural Networks
  • Residual Connections and Gating Mechanisms
  • Convolutional Layers: From Images to Time Series
  • Neural Network Embeddings: Learning Meaningful Representations
  • Attention Mechanisms: Learning What to Focus On
  • Encoder-Decoder Architectures
  • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting

On this page

  • Residual Connections and Gating Mechanisms
    • Section 4.5 - The Optimization Challenge in Deep Networks
    • Section 4.6 - Residual Connections: A Direct Path for Gradients
      • Mathematical Formulation
      • Gradient Flow Analysis
    • Section 4.7 - Gated Linear Units (GLU)
      • Mathematical Framework
    • Section 4.8 - Gated Residual Networks (GRN)
      • Architecture
      • Functional Properties

Residual Connections and Gating Mechanisms

Course
Advanced Concepts
Understanding fundamental building blocks that enable training of deep neural networks: residual connections and gating mechanisms.
Author

Remi Genet

Published

2025-04-03

Residual Connections and Gating Mechanisms

Section 4.5 - The Optimization Challenge in Deep Networks

Training deep neural networks faces a fundamental challenge: as networks become deeper, the gradient signal weakens exponentially as it propagates backward through the layers. Consider a network of ( L ) layers where each layer ( l ) performs a transformation \[ h_l = f_l(h_{l-1}). \] The gradient of the loss ( L ) with respect to an early layer’s activations follows the chain rule:

\[ \frac{\partial L}{\partial h_k} = \frac{\partial L}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdot \ldots \cdot \frac{\partial h_{k+1}}{\partial h_k}. \]

Each Jacobian matrix ( ) typically has eigenvalues less than 1 due to nonlinear activations like ReLU or tanh. Their repeated multiplication causes the gradient to vanish exponentially with depth, making learning ineffective in early layers.

Section 4.6 - Residual Connections: A Direct Path for Gradients

Mathematical Formulation

A residual connection modifies the standard layer transformation from:

\[ h_{l+1} = F_l(h_l) \]

to:

\[ h_{l+1} = F_l(h_l) + h_l, \]

where ( F_l ) represents the layer’s nonlinear transformation (typically a sequence of neural network layers). This creates an identity shortcut connection parallel to the transformation.

Gradient Flow Analysis

The gradient now flows through two paths:

\[ \frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_{l+1}} \cdot \left(\frac{\partial F_l}{\partial h_l} + I\right), \]

where ( I ) is the identity matrix. This additive term ensures that even if ( ) becomes very small, the gradient can still flow effectively through the identity path.

Section 4.7 - Gated Linear Units (GLU)

While residual connections provide stable gradient paths, we often want more control over information flow. Gated Linear Units offer a learnable mechanism to modulate signals adaptively.

Mathematical Framework

For an input ( x ^d ), a GLU computes:

\[ \operatorname{GLU}(x) = \sigma(W_g x + b_g) \odot (W_t x + b_t), \]

where: - ( ) is the sigmoid function, - ( W_g, W_t ^{d d} ) are weight matrices, - ( b_g, b_t ^d ) are bias vectors, - ( ) denotes element-wise multiplication.

The sigmoid gate ( (W_g x + b_g) ) produces values in ([0,1]), controlling how much of the transformed signal ( (W_t x + b_t) ) passes through. This mechanism allows the network to adaptively modulate information flow based on the input.

Section 4.8 - Gated Residual Networks (GRN)

GRN Cell Structure

GRN Cell Structure

Gated Residual Networks combine residual connections with gating mechanisms to create a powerful building block for deep networks.

Architecture

A GRN processes its input through several stages:

  1. Input Transformation:

    \[ \eta_2 = \operatorname{ELU}(W_2 x + b_2) \]

  2. Linear Projection:

    \[ \eta_1 = W_1 \eta_2 + b_1 \]

  3. Gated Output:

    \[ \operatorname{GRN}(x) = \operatorname{LayerNorm}\Bigl(x + \operatorname{GLU}(\eta_1)\Bigr) \]

Here: - ELU is the Exponential Linear Unit activation, - LayerNorm provides normalization for stability, - The residual connection ( (+x) ) maintains gradient flow.

Functional Properties

This architecture offers several key advantages:

  1. The residual connection ensures stable gradient propagation through deep networks.
  2. The gating mechanism (GLU) allows the network to modulate the transformation’s contribution adaptively.
  3. Layer normalization stabilizes activations across the network’s depth.
  4. The ELU activation provides smooth gradients and helps mitigate the vanishing gradient problem further.

The combination of these elements creates a powerful building block that enables the training of very deep networks while maintaining effective learning dynamics. GRNs have proven particularly effective in complex tasks requiring deep architectures, as they provide both the stability of residual connections and the flexibility of adaptive gating.

Back to top
Essential Building Blocks of Modern Neural Networks
Convolutional Layers: From Images to Time Series

Deep Learning For Finance, Rémi Genet.
Licence
Code source disponible sur Github

 

Site construit avec et Quarto
Inspiration pour la mise en forme du site ici
Code source disponible sur GitHub