DeepLearning For Finance
  • Back to Main Website
  • Home
  • Introduction to Deep Learning
    • Introduction to Deep Learning
    • From Traditional Models to Deep Learning
    • The Multi-Layer Perceptron (MLP)
    • Automatic Differentiation: The Engine of Deep Learning
    • Computation Backends & Keras 3
    • GPUs and Deep Learning: When Hardware Matters
    • Keras Fundamentals: Models & Layers
    • Keras Matrix Operations: The Building Blocks
    • Activation Functions: Adding Non-linearity
    • Model Training Fundamentals

    • Travaux Pratiques
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
  • Recurrent Neural Networks
    • Recurrent Neural Networks
    • Sequential Data Processing: From MLPs to RNNs
    • Long Short-Term Memory Networks (LSTM)
    • Modern RNN Architectures
    • RNN Limitations: Computational Challenges

    • Travaux Pratiques
    • TP: Recurrent Neural Networks for Time Series Prediction
    • TP Corrected: Recurrent Neural Networks for Time Series Prediction
  • Training a Neural Network
    • Training a Neural Network
    • Understanding the Training Loop
    • Understanding Optimizers
    • Understanding Callbacks
    • Training Parameters and Practical Considerations

    • Travaux Pratiques
    • TP: Using Deep Learning Frameworks for General Optimization
    • tp_general_optimization_corrected.html
    • TP: Impact of Callbacks on Training
  • Essential Building Blocks of Modern Neural Networks
    • Essential Building Blocks of Modern Neural Networks
    • Residual Connections and Gating Mechanisms
    • Convolutional Layers: From Images to Time Series
    • Neural Network Embeddings: Learning Meaningful Representations
    • Attention Mechanisms: Learning What to Focus On
    • Encoder-Decoder Architectures

    • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
  • Projets
    • Projets
  • Code source
  1. Understanding Optimizers
  • Training a Neural Network
  • Understanding the Training Loop
  • Understanding Optimizers
  • Understanding Callbacks
  • Training Parameters and Practical Considerations
  • Travaux Pratiques
    • TP: Using Deep Learning Frameworks for General Optimization
    • tp_general_optimization_corrected.html
    • TP: Impact of Callbacks on Training
    • content/Cours_3/keras_callbacks_corrected.ipynb

On this page

  • Optimization Algorithms in Deep Learning
    • Section 3.5 - From Gradient Descent to Modern Optimizers
      • Basic Gradient Descent
    • Section 3.6 - Modern Optimizers
      • 1. Momentum
      • 2. RMSprop
      • 3. Adam (Adaptive Moment Estimation)
      • 4. AdaGrad
      • 5. Nesterov Accelerated Gradient (NAG)
    • Section 3.7 - Choosing Learning Rates
      • General Guidelines
    • Section 3.8 - Optimizer Selection Guide
      • When to Use Each Optimizer

Understanding Optimizers

Course
Fundamentals
Deep dive into optimization algorithms in deep learning, their mathematical foundations and practical applications.
Author

Remi Genet

Published

2025-04-03

Optimization Algorithms in Deep Learning

Section 3.5 - From Gradient Descent to Modern Optimizers

Basic Gradient Descent

The foundation of all optimization in deep learning is given by:

\[ \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \]

where: - (_t) is the parameter at time (t) - () is the learning rate - (L(_t)) is the gradient of the loss function at (_t)

Note

The choice of learning rate () is crucial: - Too large: overshooting, divergence - Too small: slow convergence

Section 3.6 - Modern Optimizers

1. Momentum

Momentum adds a “velocity” term to parameter updates:

\[ v_t = \gamma v_{t-1} + \eta \nabla L(\theta_t) \]

\[ \theta_{t+1} = \theta_t - v_t \]

where: - () is the momentum coefficient (typically 0.9) - (v_t) is the velocity term

Benefits: - Helps escape local minima - Smooths oscillations - Faster convergence in ravines

2. RMSprop

RMSprop adapts learning rates on a per-parameter basis:

\[ r_t = \rho r_{t-1} + (1 - \rho)(\nabla L(\theta_t))^2 \]

\[ \theta_{t+1} = \theta_t - \frac{\eta \nabla L(\theta_t)}{\sqrt{r_t + \epsilon}} \]

where: - () is the decay rate (typically 0.9) - () is a small constant for numerical stability - (r_t) maintains a running average of squared gradients

3. Adam (Adaptive Moment Estimation)

Adam combines momentum and adaptive learning rates. Its update equations are as follows:

\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(\theta_t) \]

\[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(\theta_t))^2 \]

Bias correction:

\[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} \]

\[ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \]

Parameter update:

\[ \theta_{t+1} = \theta_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}} \]

where: - (_1) is the first moment decay (typically 0.9) - (_2) is the second moment decay (typically 0.999) - () is a numerical stability term (typically (1 ^{-8}))

Tip

Adam is currently the most widely used optimizer because: - It works well out of the box - It adapts the learning rate per parameter - It combines the benefits of momentum and RMSprop

4. AdaGrad

AdaGrad adapts the learning rate based on the history of gradients:

\[ r_t = r_{t-1} + (\nabla L(\theta_t))^2 \]

\[ \theta_{t+1} = \theta_t - \frac{\eta \nabla L(\theta_t)}{\sqrt{r_t + \epsilon}} \]

Note: - The learning rate decreases over time - It can stop learning too early on deep networks

5. Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient (NAG) is a modified momentum method that looks ahead:

\[ v_t = \gamma v_{t-1} + \eta \nabla L\Bigl(\theta_t - \gamma v_{t-1}\Bigr) \]

\[ \theta_{t+1} = \theta_t - v_t \]

Benefits: - More responsive than standard momentum - Better theoretical convergence properties

Section 3.7 - Choosing Learning Rates

General Guidelines

  1. Starting Learning Rate:
# Common ranges per optimizer
learning_rates = {
    'sgd': 0.01,      # Higher because no adaptation
    'momentum': 0.01,  # Similar to SGD
    'adam': 0.001,    # Lower due to adaptation
    'rmsprop': 0.001  # Lower due to adaptation
}
  1. Learning Rate Schedules:

Step Decay:

\[ \eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor} \]

where: - (_0) is the initial learning rate - () is the decay factor - (s) is the step size

Exponential Decay:

\[ \eta_t = \eta_0 \cdot e^{-kt} \]

where (k) is the decay rate.

Section 3.8 - Optimizer Selection Guide

When to Use Each Optimizer

  1. Adam:
    • Default choice for most problems.
    • Good for large datasets.
    • Works well with sparse gradients.
  2. SGD + Momentum:
    • Can give better final convergence.
    • Often used in computer vision.
    • May require careful learning rate tuning.
  3. RMSprop:
    • Good for recurrent networks.
    • Handles non-stationary objectives.
    • Serves as an alternative to Adam.
Key Considerations
  1. Computational Cost:
    • Adam requires more memory.
    • SGD is computationally simpler.
    • There is a trade-off between memory usage and convergence speed.
  2. Problem Type:
    • For CNNs: SGD with Momentum or Adam.
    • For RNNs: RMSprop or Adam.
    • For dense networks: Any optimizer works well, but Adam is often preferred.
  3. Dataset Size:
    • Large datasets: Adaptive methods (like Adam) are advantageous.
    • Small datasets: Simpler methods may generalize better.
Back to top
Understanding the Training Loop
Understanding Callbacks

Deep Learning For Finance, Rémi Genet.
Licence
Code source disponible sur Github

 

Site construit avec et Quarto
Inspiration pour la mise en forme du site ici
Code source disponible sur GitHub