Activation Functions: Adding Non-linearity
Activation Functions: The Non-linear Component
Section 1.36 - Why We Need Activations
Role in Neural Networks
- Add Non-linearity: Without activations, neural networks would be just linear combinations.
- Control Output Range: Normalize or bound outputs to specific ranges.
- Enable Gradient Flow: Different activations affect how gradients propagate.
Let’s understand why non-linear activations are crucial:
- Linear Case (No activation):
# Layer 1
= w11*x1 + w12*x2 + b1
h1 # Layer 2
= w21*h1 + b2
output
# Substituting h1 into output:
= w21*(w11*x1 + w12*x2 + b1) + b2
output = (w21*w11)*x1 + (w21*w12)*x2 + (w21*b1 + b2)
= W1*x1 + W2*x2 + B # Still just a linear function!
No matter how many layers we stack, without activation functions, we just get another linear function. This means we can’t learn complex patterns!
- Non-linear Case (With ReLU):
# Layer 1
= relu(w11*x1 + w12*x2 + b1)
h1 = max(0, w11*x1 + w12*x2 + b1)
# Layer 2
= w21*h1 + b2 output
Now the network can learn:
- Different behaviors for positive/negative inputs
- AND/OR-like logic gates
- Curved decision boundaries
- Complex feature interactions
Consider an XOR function:
# Truth table
# x1 x2 | output
# 0 0 | 0
# 0 1 | 1
# 1 0 | 1
# 1 1 | 0
# XOR implementation with ReLU
# Layer 1: Create two decision boundaries
= relu(x1 + x2 - 1.5) # Activates when both inputs are 1
h1 = relu(x1 + x2 - 0.5) # Activates when at least one input is 1
h2
# Layer 2: Combine to get XOR
= h2 - 2*h1 # Subtract "both-1" case from "at-least-one-1" case output
This demonstrates how non-linear activations allow the network to:
- Create complex decision boundaries
- Learn hierarchical features
- Approximate any continuous function (Universal Approximation Theorem)
Section 1.37 - Common Activation Functions
1. ReLU (Rectified Linear Unit)
\[ f(x) = \max(0, x) \]
Properties:
- Easy gradient computation
- Sparse activation (outputs exact zeros)
- No upper bound
- Default choice for hidden layers
Use Cases:
- Hidden layers in most networks
- Deep networks (helps with vanishing gradient)
2. Sigmoid
\[ f(x) = \frac{1}{1 + e^{-x}} \]
Properties:
- Output range: (0, 1)
- Smooth gradient
- Can saturate (gradient ≈ 0 for large \(\lvert x \rvert\))
Use Cases:
- Binary classification output layer
- Gates in LSTM/GRU cells
3. Tanh (Hyperbolic Tangent)
\[ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]
Properties:
- Output range: (-1, 1)
- Zero-centered
- Stronger gradients than sigmoid
Use Cases:
- Hidden layers in shallow networks
- RNN/LSTM states
- Normalized outputs
4. Softmax
\[ f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} \]
Properties:
- Outputs sum to 1
- Emphasizes largest values
- Converts logits to probabilities
Use Cases:
- Multiclass classification output layer
- Attention mechanisms
5. LeakyReLU
\[ f(x) = \max(\alpha x, x) \quad \text{where } \alpha \approx 0.01 \]
Properties:
- Like ReLU but allows small negative gradients
- Helps prevent the “dying ReLU” problem
Use Cases:
- Alternative to ReLU in deep networks
- When dead neurons are an issue
Section 1.38 - Choosing Activation Functions
Guidelines by Layer Type
Hidden Layers:
- Start with ReLU (default choice)
- Try LeakyReLU if you see dead neurons
- Use Tanh for shallow networks or RNNs
Output Layer:
Task Type | Activation | Output Range |
---|---|---|
Binary Classification | Sigmoid | (0, 1) |
Multiclass | Softmax | [0, 1] (sum to 1) |
Regression | Linear (none) | \[(-\infty, \infty)\] |
Bounded Regression | Tanh | \[(-1, 1)\] |
Implementation in Keras
from keras import layers
# As string
= layers.Dense(64, activation='relu')
layer
# As function
from keras.activations import relu
= layers.Dense(64, activation=relu)
layer
# As layer
= layers.Dense(64)
layer = layers.Activation('relu') activation
Section 1.39 - Universal Approximation
Why Neural Networks Can Approximate Any Function
The Universal Approximation Theorem states that a neural network with:
- One hidden layer (wide enough)
- Any continuous activation function
- A linear output layer
can approximate any continuous function on a compact domain to arbitrary precision.
Consider \[ f(x) = x^2 \]:
# Can be approximated by a ReLU network:
def approx_square(x):
# Break into many linear pieces
= relu(x - 0.0) * 0.0 # Slope at x=0
h1 = relu(x - 0.5) * 1.0 # Slope at x=0.5
h2 = relu(x - 1.0) * 2.0 # Slope at x=1.0
h3 return h1 + h2 + h3 # Combine pieces
Key Points:
- More neurons = finer approximation
- Deeper networks = more efficient approximation
- Choice of activation affects:
- How many neurons are needed
- How easy it is to train
- The quality of approximation
Practical Implications:
- The theoretical capability does not necessarily mean it’s easy to train.
- ReLU often works best in practice because:
- Its linear pieces are easy to optimize
- It promotes sparse activation (many exact zeros)
- It provides strong gradient flow when active
Modern Variants
- GELU (Gaussian Error Linear Unit)
\[ f(x) = x \cdot \Phi(x) \]
where \[\Phi(x)\] is the cumulative distribution function of the standard normal distribution. - Smooth transitions like sigmoid - Used in transformers (BERT, GPT)
- SiLU/Swish (Sigmoid Linear Unit)
\[ f(x) = x \cdot \operatorname{sigmoid}(x) = \frac{x}{1 + e^{-x}} \]
- Similar to GELU but computationally simpler
- Strong performance in deep networks
- SELU (Scaled Exponential Linear Unit)
\[ f(x) = \lambda \begin{cases} x, & \text{if } x > 0, \\ \alpha \left(e^x - 1\right), & \text{if } x \leq 0, \end{cases} \]
where \[\lambda \approx 1.0507\] and \[\alpha \approx 1.6733\]. - Self-normalizing properties - Good for deep fully-connected networks
While there are many activation functions, ReLU is the standard choice for most applications. Only switch if you have a specific reason or observe specific problems in your network.