Understanding Optimizers

Remi Genet

Understanding Optimizers

Course

Fundamentals

Deep dive into optimization algorithms in deep learning, their mathematical foundations and practical applications.

Author

Remi Genet

Published

2025-04-03

Optimization Algorithms in Deep Learning

Section 3.5 - From Gradient Descent to Modern Optimizers

Basic Gradient Descent

The foundation of all optimization in deep learning is given by:

\[ \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \]

where: - (_t) is the parameter at time (t) - () is the learning rate - (L(_t)) is the gradient of the loss function at (_t)

Note

The choice of learning rate () is crucial: - Too large: overshooting, divergence - Too small: slow convergence

Section 3.6 - Modern Optimizers

1. Momentum

Momentum adds a “velocity” term to parameter updates:

\[ v_t = \gamma v_{t-1} + \eta \nabla L(\theta_t) \]

\[ \theta_{t+1} = \theta_t - v_t \]

where: - () is the momentum coefficient (typically 0.9) - (v_t) is the velocity term

Benefits: - Helps escape local minima - Smooths oscillations - Faster convergence in ravines

2. RMSprop

RMSprop adapts learning rates on a per-parameter basis:

\[ r_t = \rho r_{t-1} + (1 - \rho)(\nabla L(\theta_t))^2 \]

\[ \theta_{t+1} = \theta_t - \frac{\eta \nabla L(\theta_t)}{\sqrt{r_t + \epsilon}} \]

where: - () is the decay rate (typically 0.9) - () is a small constant for numerical stability - (r_t) maintains a running average of squared gradients

3. Adam (Adaptive Moment Estimation)

Adam combines momentum and adaptive learning rates. Its update equations are as follows:

\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(\theta_t) \]

\[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(\theta_t))^2 \]

Bias correction:

\[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} \]

\[ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \]

Parameter update:

\[ \theta_{t+1} = \theta_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}} \]

where: - (_1) is the first moment decay (typically 0.9) - (_2) is the second moment decay (typically 0.999) - () is a numerical stability term (typically (1 ^{-8}))

Tip

Adam is currently the most widely used optimizer because: - It works well out of the box - It adapts the learning rate per parameter - It combines the benefits of momentum and RMSprop

4. AdaGrad

AdaGrad adapts the learning rate based on the history of gradients:

\[ r_t = r_{t-1} + (\nabla L(\theta_t))^2 \]

\[ \theta_{t+1} = \theta_t - \frac{\eta \nabla L(\theta_t)}{\sqrt{r_t + \epsilon}} \]

Note: - The learning rate decreases over time - It can stop learning too early on deep networks

5. Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient (NAG) is a modified momentum method that looks ahead:

\[ v_t = \gamma v_{t-1} + \eta \nabla L\Bigl(\theta_t - \gamma v_{t-1}\Bigr) \]

\[ \theta_{t+1} = \theta_t - v_t \]

Benefits: - More responsive than standard momentum - Better theoretical convergence properties

Section 3.7 - Choosing Learning Rates

General Guidelines

Starting Learning Rate:

# Common ranges per optimizer
learning_rates = {
    'sgd': 0.01,      # Higher because no adaptation
    'momentum': 0.01,  # Similar to SGD
    'adam': 0.001,    # Lower due to adaptation
    'rmsprop': 0.001  # Lower due to adaptation
}

Learning Rate Schedules:

Step Decay:

\[ \eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor} \]

where: - (_0) is the initial learning rate - () is the decay factor - (s) is the step size

Exponential Decay:

\[ \eta_t = \eta_0 \cdot e^{-kt} \]

where (k) is the decay rate.

Section 3.8 - Optimizer Selection Guide

When to Use Each Optimizer

Adam:
- Default choice for most problems.
- Good for large datasets.
- Works well with sparse gradients.
SGD + Momentum:
- Can give better final convergence.
- Often used in computer vision.
- May require careful learning rate tuning.
RMSprop:
- Good for recurrent networks.
- Handles non-stationary objectives.
- Serves as an alternative to Adam.

Key Considerations

Computational Cost:
- Adam requires more memory.
- SGD is computationally simpler.
- There is a trade-off between memory usage and convergence speed.
Problem Type:
- For CNNs: SGD with Momentum or Adam.
- For RNNs: RMSprop or Adam.
- For dense networks: Any optimizer works well, but Adam is often preferred.
Dataset Size:
- Large datasets: Adaptive methods (like Adam) are advantageous.
- Small datasets: Simpler methods may generalize better.

Back to top