Automatic Differentiation: The Engine of Deep Learning

Remi Genet

Automatic Differentiation: The Engine of Deep Learning

Cours

Fundamentals

A technical deep-dive into gradient computation methods, highlighting why automatic differentiation became the backbone of modern machine learning.

Author

Remi Genet

Published

2025-04-03

Gradient Computation: Three Approaches

Section 1.12 - Finite Difference Approximation

Core Concept

Estimate derivatives using function evaluations at perturbed inputs:

First Derivative:
\[ \frac{\partial f}{\partial x} \approx \frac{f(x + \varepsilon) - f(x)}{\varepsilon} \]
(Forward difference formula)

Example:
For \[ f(x) = x^2 \]
at ( x=2 ) with ( ): \[ f'(2) \approx \frac{(2.01)^2 - 2^2}{0.01} = \frac{4.0401 - 4}{0.01} = 4.01, \]
while the true derivative is ( 4 ).

Limitations

Precision Trade-off:
- Too large ( ) → Approximation error
- Too small ( ) → Numerical instability
Computational Cost:
( O(n) ) evaluations for ( n ) parameters → Intractable for neural networks (with ( n ^6-109 )).

Section 1.13 - Symbolic Differentiation

What It Does

Manipulates mathematical expressions to derive exact derivative formulas.

Input:
\[ f(x) = \sin(x^2) + e^x \]

Symbolic Derivative:
\[ f'(x) = 2x \cdot \cos(x^2) + e^x. \]

The Expression Swell Problem

Consider nested operations:

\[ f(x) = (x+1)^{10}, \] so the compact derivative is \[ f'(x) = 10(x+1)^9. \]

If not simplified, the derivative may appear as an expanded product: \[ f'(x) = 10(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1)(x+1), \] which leads to: - Memory Impact: Storage grows exponentially with operation depth. - Real-World Impact: A 100-layer network would produce unusable expressions.

Section 1.14 - Automatic Differentiation (AD)

Foundational Idea

Decompose functions into elementary operations, then apply the chain rule numerically.

Example:
For \[ f(x,y) = x \cdot y + \sin(x), \]
the computation is as follows:

Forward Pass:
- ( a = x y )
- ( b = (x) )
- ( f = a + b )
Backward Pass (Reverse-Mode AD):
- ( = 1, = 1 )
- ( = y, = x )
- ( = (x) )
- Thus, \[ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial a}\cdot\frac{\partial a}{\partial x} + \frac{\partial f}{\partial b}\cdot\frac{\partial b}{\partial x} = y + \cos(x). \]

Why It Scales

Complexity: ( O(1) ) per parameter regardless of network depth.
Memory: Stores only intermediate values, not full symbolic expressions.

Section 1.15 - Method Comparison

Method	Precision	Compute Cost	Memory	Scalability
Finite Difference	Approximate	( O(n) )	( O(1) )	Small ( n )
Symbolic	Exact	( O(1) )	Exponential	Shallow nets
Automatic Diff	Exact	( O(1) )	Linear	Any model

Section 1.16 - Composable Differentiability

Mathematical Guarantee

If ( f ) and ( g ) are differentiable, then ( h(x) = f(g(x)) ) is differentiable with: \[ h'(x) = f'(g(x)) \cdot g'(x). \]

Implication for Deep Learning:
Neural networks can be built from arbitrary differentiable components: 1. Basic Operations: Matrix multiply, convolution
2. Advanced Layers: Self-attention, spectral norms
3. Custom Functions: User-defined gradients

Section 1.17 - AD by Example

Computational Graph

Consider the function: \[ f(x,y) = \bigl(x + \sigma(y)\bigr) \cdot \tanh(x), \] where ( ) is the sigmoid function.

A simplified computational graph is:

    x       y
    │       │
    ├─ (+) ─┐
    │       σ
    └─ tanh │
         │  │
         └─ mul
             │
             f

Gradient Calculation

Forward Pass: Compute all node values at ( x=2,, y=1 )
- ( (1) )
- ( (2) )
- ( f (2 + 0.731) )
Reverse Pass: Propagate derivatives
- ( = 1 )
- ( = 2.731 )
  ( = 2.731 )
- ( = 1 - ^2(2) )
  ( = 2.731 )
- ( = 0.964 )
  ( = 0.964 )
- ( = 1 )
  ( = 0.964 )
- Total: \[ \frac{\partial f}{\partial x} \approx 0.194 + 0.964 = 1.158. \]

Back to top