GPUs and Deep Learning: When Hardware Matters
CPU vs GPU: The Deep Learning Divide
Section 1.22 - Architectural Differences
CPU (Central Processing Unit)
- Design: Few complex cores (4–64) optimized for sequential tasks
- Strengths:
- Fast single-thread performance
- Handles diverse workloads (file I/O, system tasks)
- Fast single-thread performance
- Analog: A master chef preparing dishes one at a time
GPU (Graphics Processing Unit)
- Design: 1000s of simple cores optimized for parallel tasks
- Strengths:
- Massively parallel floating-point operations
- Efficient matrix/tensor computations
- Massively parallel floating-point operations
- Analog: A kitchen army chopping 1000 vegetables simultaneously
Section 1.23 - Matrix Multiplication: GPU’s Sweet Spot
Why GPUs Dominate Deep Learning
Neural network forward pass for layer \(l\): \[ \mathbf{h}^{(l)} = \varphi\Bigl(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\Bigr) \]
GPU Advantages:
1. Parallelize matrix multiplications across 1000s of cores
2. Batch operations: Process multiple samples simultaneously
3. Specialized cores: Tensor Cores (NVIDIA) accelerate mixed-precision math
Performance Gain:
- CPU: ~100 GFLOPS (e.g., Intel i9)
- GPU: ~50 TFLOPS (e.g., NVIDIA A100) → 500× faster
Section 1.24 - When GPUs Aren’t Worth It
Case 1: Small Models
A 2-layer MLP (input=64, hidden=32, output=1):
- Parameters:
\[
(64 \times 32) + (32 \times 1) + \text{biases} = 2080
\]
- CPU Time: 0.5 ms/batch (direct cache access)
- GPU Time: 2 ms/batch (data transfer overhead dominates)
Case 2: Non-Matrix Work
- Data preprocessing (Pandas operations)
- Decision tree training (sequential splits)
- HTTP server handling
Section 1.25 - VRAM: The Memory Bottleneck
Why It Matters
- Stores model weights and activations during training
- Example requirements:
- LSTM with 1000 units: ~16 MB
- GPT-4: ~1 TB (requires multi-GPU)
- LSTM with 1000 units: ~16 MB
Course Context:
Time series models rarely exceed 100 MB → Fit in CPU RAM
Section 1.26 - Practical Considerations
For This Course
- No GPU Needed:
- All practicals are designed for CPU execution
- Typical training times <10 minutes per exercise
- All practicals are designed for CPU execution
- Why?
- Small datasets (synthetic or historical market data)
- Compact architectures (≤5 layers, ≤256 units)
- Small datasets (synthetic or historical market data)
Experimenting Beyond
Cloud GPU Options:
Platform | Cost | Setup Complexity |
---|---|---|
Google Colab | Free (T4 GPU) | Low (browser) |
Vast.ai | ~$0.15/hr (RTX 3090) | Medium (Docker) |
AWS EC2 | ~$0.5/hr (T4) | High (IAM/VPC) |
First-Time Setup Guide:
1. Create an account on the chosen platform
2. Upload your Jupyter notebook
3. Select a GPU instance type
4. Run !nvidia-smi
to verify GPU access
Historical Note
The 2012 AlexNet breakthrough (ImageNet classification) was enabled by NVIDIA GTX 580 GPUs, training in 5 days versus months on CPUs. Modern LLMs like GPT-4 would be infeasible without GPU clusters.