Automatic Differentiation: The Engine of Deep Learning
Gradient Computation: Three Approaches
Section 1.12 - Finite Difference Approximation
Core Concept
Estimate derivatives using function evaluations at perturbed inputs:
First Derivative:
\[
\frac{\partial f}{\partial x} \approx \frac{f(x + \varepsilon) - f(x)}{\varepsilon}
\]
(Forward difference formula)
Example:
For \[
f(x) = x^2
\]
at ( x=2 ) with ( ): \[
f'(2) \approx \frac{(2.01)^2 - 2^2}{0.01} = \frac{4.0401 - 4}{0.01} = 4.01,
\]
while the true derivative is ( 4 ).
Limitations
- Precision Trade-off:
- Too large ( ) → Approximation error
- Too small ( ) → Numerical instability
- Too large ( ) → Approximation error
- Computational Cost:
( O(n) ) evaluations for ( n ) parameters → Intractable for neural networks (with ( n 6-109 )).
Section 1.13 - Symbolic Differentiation
What It Does
Manipulates mathematical expressions to derive exact derivative formulas.
Input:
\[
f(x) = \sin(x^2) + e^x
\]
Symbolic Derivative:
\[
f'(x) = 2x \cdot \cos(x^2) + e^x.
\]
The Expression Swell Problem
Consider nested operations:
\[ f(x) = (x+1)^{10}, \] so the compact derivative is \[ f'(x) = 10(x+1)^9. \]
If not simplified, the derivative may appear as an expanded product: \[ f'(x) = 10(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1), \] which leads to: - Memory Impact: Storage grows exponentially with operation depth. - Real-World Impact: A 100-layer network would produce unusable expressions.
Section 1.14 - Automatic Differentiation (AD)
Foundational Idea
Decompose functions into elementary operations, then apply the chain rule numerically.
Example:
For \[
f(x,y) = x \cdot y + \sin(x),
\]
the computation is as follows:
- Forward Pass:
- ( a = x y )
- ( b = (x) )
- ( f = a + b )
- ( a = x y )
- Backward Pass (Reverse-Mode AD):
- ( = 1, = 1 )
- ( = y, = x )
- ( = (x) )
- Thus, \[ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial a}\cdot\frac{\partial a}{\partial x} + \frac{\partial f}{\partial b}\cdot\frac{\partial b}{\partial x} = y + \cos(x). \]
- ( = 1, = 1 )
Why It Scales
- Complexity: ( O(1) ) per parameter regardless of network depth.
- Memory: Stores only intermediate values, not full symbolic expressions.
Section 1.15 - Method Comparison
Method | Precision | Compute Cost | Memory | Scalability |
---|---|---|---|---|
Finite Difference | Approximate | ( O(n) ) | ( O(1) ) | Small ( n ) |
Symbolic | Exact | ( O(1) ) | Exponential | Shallow nets |
Automatic Diff | Exact | ( O(1) ) | Linear | Any model |
Section 1.16 - Composable Differentiability
Mathematical Guarantee
If ( f ) and ( g ) are differentiable, then ( h(x) = f(g(x)) ) is differentiable with: \[ h'(x) = f'(g(x)) \cdot g'(x). \]
Implication for Deep Learning:
Neural networks can be built from arbitrary differentiable components: 1. Basic Operations: Matrix multiply, convolution
2. Advanced Layers: Self-attention, spectral norms
3. Custom Functions: User-defined gradients
Section 1.17 - AD by Example
Computational Graph
Consider the function: \[ f(x,y) = \bigl(x + \sigma(y)\bigr) \cdot \tanh(x), \] where ( ) is the sigmoid function.
A simplified computational graph is:
x y
│ │
├─ (+) ─┐
│ σ
└─ tanh │
│ │
└─ mul
│
f
Gradient Calculation
- Forward Pass: Compute all node values at ( x=2,, y=1 )
- ( (1) )
- ( (2) )
- ( f (2 + 0.731) )
- Reverse Pass: Propagate derivatives
- ( = 1 )
- ( = 2.731 )
( = 2.731 ) - ( = 1 - ^2(2) )
( = 2.731 ) - ( = 0.964 )
( = 0.964 ) - ( = 1 )
( = 0.964 ) - Total: \[ \frac{\partial f}{\partial x} \approx 0.194 + 0.964 = 1.158. \]