DeepLearning For Finance
  • Back to Main Website
  • Home
  • Introduction to Deep Learning
    • Introduction to Deep Learning
    • From Traditional Models to Deep Learning
    • The Multi-Layer Perceptron (MLP)
    • Automatic Differentiation: The Engine of Deep Learning
    • Computation Backends & Keras 3
    • GPUs and Deep Learning: When Hardware Matters
    • Keras Fundamentals: Models & Layers
    • Keras Matrix Operations: The Building Blocks
    • Activation Functions: Adding Non-linearity
    • Model Training Fundamentals

    • Travaux Pratiques
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
  • Recurrent Neural Networks
    • Recurrent Neural Networks
    • Sequential Data Processing: From MLPs to RNNs
    • Long Short-Term Memory Networks (LSTM)
    • Modern RNN Architectures
    • RNN Limitations: Computational Challenges

    • Travaux Pratiques
    • TP: Recurrent Neural Networks for Time Series Prediction
    • TP Corrected: Recurrent Neural Networks for Time Series Prediction
  • Training a Neural Network
    • Training a Neural Network
    • Understanding the Training Loop
    • Understanding Optimizers
    • Understanding Callbacks
    • Training Parameters and Practical Considerations

    • Travaux Pratiques
    • TP: Using Deep Learning Frameworks for General Optimization
    • tp_general_optimization_corrected.html
    • TP: Impact of Callbacks on Training
  • Essential Building Blocks of Modern Neural Networks
    • Essential Building Blocks of Modern Neural Networks
    • Residual Connections and Gating Mechanisms
    • Convolutional Layers: From Images to Time Series
    • Neural Network Embeddings: Learning Meaningful Representations
    • Attention Mechanisms: Learning What to Focus On
    • Encoder-Decoder Architectures

    • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
  • Projets
    • Projets
  • Code source
  1. RNN Limitations: Computational Challenges
  • Recurrent Neural Networks
  • Sequential Data Processing: From MLPs to RNNs
  • Long Short-Term Memory Networks (LSTM)
  • Modern RNN Architectures
  • RNN Limitations: Computational Challenges
  • Travaux Pratiques
    • TP: Recurrent Neural Networks for Time Series Prediction
    • TP Corrected: Recurrent Neural Networks for Time Series Prediction

On this page

  • RNN Limitations and Performance Bottlenecks
    • Section 2.12 - Sequential Computation Issues
      • The Sequential Nature Problem
      • Impact on Training Speed
    • Section 2.13 - Gradient Flow Issues
      • The Vanishing Gradient Problem
    • Section 2.14 - Hardware Utilization
      • GPU Architecture Mismatch
      • Computational Complexity
    • Section 2.15 - Alternative Approaches
      • Attention-Based Solutions
      • Hybrid Approaches

RNN Limitations: Computational Challenges

Course
Deep Learning
Understanding the computational limitations of RNNs and their impact on training efficiency.
Author

Remi Genet

Published

2025-04-03

RNN Limitations and Performance Bottlenecks

Section 2.12 - Sequential Computation Issues

The Sequential Nature Problem

RNNs process sequences step by step, which creates inherent limitations. For example, conceptually an RNN processes a sequence as follows:

# Conceptual RNN processing
for t in range(sequence_length):
    h_t = f(x_t, h_{t-1})  # Must wait for previous step

This contrasts with fully parallel architectures such as in CNNs:

# Parallel processing (e.g., in CNNs)
output = W @ input  # Single matrix multiplication
GPU Underutilization

Modern GPUs excel at: - Large matrix multiplications - Parallel operations - Batch processing

But RNNs force sequential processing that can’t fully utilize this hardware.

Impact on Training Speed

Consider processing a sequence of length (T):

  1. Sequential Steps:
    • Must compute (T) steps one after another.
    • Each step depends on previous results.
    • No parallelization is possible across time steps.
  2. Memory Dependencies:

Instead of computing in parallel, each hidden state is computed as:

\[ h_t = \tanh\Bigl(W_h \cdot h_{t-1} + W_x \cdot x_t\Bigr) \]

Each computation must wait for (h_{t-1}).

Section 2.13 - Gradient Flow Issues

The Vanishing Gradient Problem

When computing gradients through time, we have:

\[ \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \cdot \prod_{i=t}^T \frac{\partial h_{i+1}}{\partial h_i} \]

Issues arise because: 1. Multiple matrix multiplications. 2. Activation function derivatives. 3. Long dependency chains.

Example: Simple RNN Gradient

For a simple RNN, the derivative at each time step is given by:

\[ \frac{\partial h_{t+1}}{\partial h_t} = W_h \cdot \operatorname{diag}\Bigl(1 - \tanh^2(h_t)\Bigr) \]

This term is multiplied (T-t) times for the gradient at time (t), leading to either: - Vanishing gradients: if (|W_h| < 1), - Exploding gradients: if (|W_h| > 1).

Section 2.14 - Hardware Utilization

GPU Architecture Mismatch

Modern GPUs are designed for:

  1. Massive Parallelism:
# What GPUs want (parallel operations)
Y = X @ W  # Single large matrix multiplication
  1. What RNNs Require (Sequential Operations):
for t in range(T):
    h[t] = f(W @ h[t-1])  # Sequential dependencies
  1. Memory Access Patterns:
    • GPUs optimize for coalesced memory access.
    • RNNs require scattered, sequential access.

Computational Complexity

For a sequence of length (T) with hidden size (H):

  1. Forward Pass:
    • (O(T)) sequential steps.
    • Each step: (O(H^2)) operations.
    • Total: (O(TH^2)) operations.
  2. Backward Pass:
    • Must store all intermediate states.
    • Memory requirement: (O(TH)).
    • Cannot parallelize across time steps.
Modern Solutions
  1. Truncated BPTT:
    • Limit gradient flow to (k) steps.
    • Reduces memory requirements.
    • May miss long-term dependencies.
  2. Batch Processing:
    • Process multiple sequences together.
    • Improves GPU utilization.
    • Still limited by the sequential nature.

Section 2.15 - Alternative Approaches

Attention-Based Solutions

Modern architectures try to address these limitations. For example, Transformers process all timesteps in parallel:

# All timesteps processed in parallel
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
attention = softmax(Q @ K.T) @ V
  • Full parallelization.
  • Direct access to any timestep.
  • Better GPU utilization.

Hybrid Approaches

  • Combine RNN and attention.
  • Balance sequential processing and parallelism.
  • Trade memory for speed.
Key Takeaways
  1. RNN limitations are fundamentally tied to their sequential nature.
  2. Modern hardware (GPUs) is optimized for parallel operations.
  3. Solutions involve either:
    • Accepting the sequential limitations,
    • Using alternative architectures,
    • Or employing hybrid approaches.
Back to top
Modern RNN Architectures
TP: Recurrent Neural Networks for Time Series Prediction

Deep Learning For Finance, Rémi Genet.
Licence
Code source disponible sur Github

 

Site construit avec et Quarto
Inspiration pour la mise en forme du site ici
Code source disponible sur GitHub