GPUs and Deep Learning: When Hardware Matters

Remi Genet

GPUs and Deep Learning: When Hardware Matters

Cours

Fundamentals

Understanding the role of specialized hardware in accelerating neural network training, and why modern AI relies on GPUs.

Author

Remi Genet

Published

2025-04-03

CPU vs GPU: The Deep Learning Divide

Section 1.22 - Architectural Differences

CPU (Central Processing Unit)

Design: Few complex cores (4–64) optimized for sequential tasks
Strengths:
- Fast single-thread performance
- Handles diverse workloads (file I/O, system tasks)
Analog: A master chef preparing dishes one at a time

GPU (Graphics Processing Unit)

Design: 1000s of simple cores optimized for parallel tasks
Strengths:
- Massively parallel floating-point operations
- Efficient matrix/tensor computations
Analog: A kitchen army chopping 1000 vegetables simultaneously

Section 1.23 - Matrix Multiplication: GPU’s Sweet Spot

Why GPUs Dominate Deep Learning

Neural network forward pass for layer $l$: \[ \mathbf{h}^{(l)} = \varphi\Bigl(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\Bigr) \]

GPU Advantages:
1. Parallelize matrix multiplications across 1000s of cores
2. Batch operations: Process multiple samples simultaneously
3. Specialized cores: Tensor Cores (NVIDIA) accelerate mixed-precision math

Performance Gain:
- CPU: ~100 GFLOPS (e.g., Intel i9)
- GPU: ~50 TFLOPS (e.g., NVIDIA A100) → 500× faster

Section 1.24 - When GPUs Aren’t Worth It

Case 1: Small Models

A 2-layer MLP (input=64, hidden=32, output=1):
- Parameters:
\[ (64 \times 32) + (32 \times 1) + \text{biases} = 2080 \]
- CPU Time: 0.5 ms/batch (direct cache access)
- GPU Time: 2 ms/batch (data transfer overhead dominates)

Case 2: Non-Matrix Work

Data preprocessing (Pandas operations)
Decision tree training (sequential splits)
HTTP server handling

Section 1.25 - VRAM: The Memory Bottleneck

Why It Matters

Stores model weights and activations during training
Example requirements:
- LSTM with 1000 units: ~16 MB
- GPT-4: ~1 TB (requires multi-GPU)

Course Context:
Time series models rarely exceed 100 MB → Fit in CPU RAM

Section 1.26 - Practical Considerations

For This Course

No GPU Needed:
- All practicals are designed for CPU execution
- Typical training times <10 minutes per exercise
Why?
- Small datasets (synthetic or historical market data)
- Compact architectures (≤5 layers, ≤256 units)

Experimenting Beyond

Cloud GPU Options:

Platform	Cost	Setup Complexity
Google Colab	Free (T4 GPU)	Low (browser)
Vast.ai	~$0.15/hr (RTX 3090)	Medium (Docker)
AWS EC2	~$0.5/hr (T4)	High (IAM/VPC)

First-Time Setup Guide:
1. Create an account on the chosen platform
2. Upload your Jupyter notebook
3. Select a GPU instance type
4. Run !nvidia-smi to verify GPU access

Historical Note

The 2012 AlexNet breakthrough (ImageNet classification) was enabled by NVIDIA GTX 580 GPUs, training in 5 days versus months on CPUs. Modern LLMs like GPT-4 would be infeasible without GPU clusters.

Back to top