Understanding Optimizers
Optimization Algorithms in Deep Learning
Section 3.5 - From Gradient Descent to Modern Optimizers
Basic Gradient Descent
The foundation of all optimization in deep learning is given by:
\[ \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \]
where: - (_t) is the parameter at time (t) - () is the learning rate - (L(_t)) is the gradient of the loss function at (_t)
The choice of learning rate () is crucial: - Too large: overshooting, divergence - Too small: slow convergence
Section 3.6 - Modern Optimizers
1. Momentum
Momentum adds a “velocity” term to parameter updates:
\[ v_t = \gamma v_{t-1} + \eta \nabla L(\theta_t) \]
\[ \theta_{t+1} = \theta_t - v_t \]
where: - () is the momentum coefficient (typically 0.9) - (v_t) is the velocity term
Benefits: - Helps escape local minima - Smooths oscillations - Faster convergence in ravines
2. RMSprop
RMSprop adapts learning rates on a per-parameter basis:
\[ r_t = \rho r_{t-1} + (1 - \rho)(\nabla L(\theta_t))^2 \]
\[ \theta_{t+1} = \theta_t - \frac{\eta \nabla L(\theta_t)}{\sqrt{r_t + \epsilon}} \]
where: - () is the decay rate (typically 0.9) - () is a small constant for numerical stability - (r_t) maintains a running average of squared gradients
3. Adam (Adaptive Moment Estimation)
Adam combines momentum and adaptive learning rates. Its update equations are as follows:
\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(\theta_t) \]
\[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(\theta_t))^2 \]
Bias correction:
\[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} \]
\[ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \]
Parameter update:
\[ \theta_{t+1} = \theta_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}} \]
where: - (_1) is the first moment decay (typically 0.9) - (_2) is the second moment decay (typically 0.999) - () is a numerical stability term (typically (1 ^{-8}))
Adam is currently the most widely used optimizer because: - It works well out of the box - It adapts the learning rate per parameter - It combines the benefits of momentum and RMSprop
4. AdaGrad
AdaGrad adapts the learning rate based on the history of gradients:
\[ r_t = r_{t-1} + (\nabla L(\theta_t))^2 \]
\[ \theta_{t+1} = \theta_t - \frac{\eta \nabla L(\theta_t)}{\sqrt{r_t + \epsilon}} \]
Note: - The learning rate decreases over time - It can stop learning too early on deep networks
5. Nesterov Accelerated Gradient (NAG)
Nesterov Accelerated Gradient (NAG) is a modified momentum method that looks ahead:
\[ v_t = \gamma v_{t-1} + \eta \nabla L\Bigl(\theta_t - \gamma v_{t-1}\Bigr) \]
\[ \theta_{t+1} = \theta_t - v_t \]
Benefits: - More responsive than standard momentum - Better theoretical convergence properties
Section 3.7 - Choosing Learning Rates
General Guidelines
- Starting Learning Rate:
# Common ranges per optimizer
= {
learning_rates 'sgd': 0.01, # Higher because no adaptation
'momentum': 0.01, # Similar to SGD
'adam': 0.001, # Lower due to adaptation
'rmsprop': 0.001 # Lower due to adaptation
}
- Learning Rate Schedules:
Step Decay:
\[ \eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor} \]
where: - (_0) is the initial learning rate - () is the decay factor - (s) is the step size
Exponential Decay:
\[ \eta_t = \eta_0 \cdot e^{-kt} \]
where (k) is the decay rate.
Section 3.8 - Optimizer Selection Guide
When to Use Each Optimizer
- Adam:
- Default choice for most problems.
- Good for large datasets.
- Works well with sparse gradients.
- SGD + Momentum:
- Can give better final convergence.
- Often used in computer vision.
- May require careful learning rate tuning.
- RMSprop:
- Good for recurrent networks.
- Handles non-stationary objectives.
- Serves as an alternative to Adam.
- Computational Cost:
- Adam requires more memory.
- SGD is computationally simpler.
- There is a trade-off between memory usage and convergence speed.
- Problem Type:
- For CNNs: SGD with Momentum or Adam.
- For RNNs: RMSprop or Adam.
- For dense networks: Any optimizer works well, but Adam is often preferred.
- Dataset Size:
- Large datasets: Adaptive methods (like Adam) are advantageous.
- Small datasets: Simpler methods may generalize better.