Training Parameters and Practical Considerations
Training Parameters and Practical Considerations
Section 3.14 - Feature Scaling: A Critical Step
Why Scaling is Crucial
Feature scaling is arguably the most important preprocessing step in deep learning. Consider a network with inputs ( x_1 ) and ( x_2 ):
\[ h = \tanh(w_1 x_1 + w_2 x_2 + b) \]
Problems without scaling:
- Gradient Issues:
- If ( x_1 ) and ( x_2 )
- The gradients ( ) and ( ) will have vastly different magnitudes
- This imbalance makes optimization nearly impossible
- Activation Function Saturation:
- Large input values can push activations into their saturation regions.
- For example, \[ \tanh(1000) \approx \tanh(10000) \approx 1 \]
- This leads to vanishing gradients.
Without proper scaling: - The optimizer struggles to find good solutions. - Training becomes unstable. - The model might not converge at all.
Scaling Methods
Standardization (Preferred for deep learning):
= (X - mean) / std X_scaled
- Centers data around 0.
- Scales data to have unit variance.
- Works well with common activation functions.
Min-Max Scaling:
= (X - X.min()) / (X.max() - X.min()) X_scaled
- Maps features to the ([0, 1]) interval.
- Useful when bounded outputs are required.
Section 3.15 - Batch Size: Finding the Sweet Spot
Batch Size Selection Principles
A good rule of thumb is to choose the largest batch size that yields between 100 and 500 batches per epoch. For example:
def optimal_batch_size(n_samples, target_batches=300):
= n_samples // target_batches
batch_size # Round to the nearest power of 2 for GPU efficiency
return 2**int(np.log2(batch_size))
Example calculations: - 100,000 samples → batch_size ≈ 256 (391 batches) - 10,000 samples → batch_size ≈ 32 (313 batches) - 1,000,000 samples → batch_size ≈ 2048 (488 batches)
- Too few batches (<100): Not enough updates per epoch.
- Too many batches (>500): Training becomes unnecessarily slow.
- Sweet spot: Provides a good balance between speed and stability.
Impact on Training
- Statistical Effects:
- Larger batches yield more precise gradient estimates.
- Smaller batches introduce noise, which can help in exploring the loss landscape.
- Optimization Effects:
- Larger batches might require higher learning rates.
- Smaller batches have an inherent regularization effect.
Section 3.16 - Weight Initialization
The Importance of Proper Initialization
Poor weight initialization can lead to: 1. Vanishing gradients 2. Exploding gradients 3. Dead neurons (especially with ReLU activations)
Key Initialization Methods
Xavier/Glorot Initialization (for tanh or sigmoid activations):
= sqrt(2.0 / (fan_in + fan_out)) std = np.random.normal(0, std, size=(fan_in, fan_out)) W
He Initialization (for ReLU activations):
= sqrt(2.0 / fan_in) std = np.random.normal(0, std, size=(fan_in, fan_out)) W
- For ReLU, use He initialization.
- For tanh or sigmoid, use Xavier (Glorot) initialization.
- For linear activations, Xavier initialization is also recommended.
Layer-Specific Considerations
# Dense layer with ReLU activation
= Dense(
layer =64,
units='he_normal',
kernel_initializer='zeros'
bias_initializer
)
# Dense layer with tanh activation
= Dense(
layer =64,
units='glorot_normal',
kernel_initializer='zeros'
bias_initializer )
Section 3.17 - Validation Strategy
Train-Validation Split
For many applications, a simple holdout validation set is sufficient. For example, in time series data you might use:
# Using the last 20% of data for validation
= int(len(X) * 0.2)
val_size = X[:-val_size], y[:-val_size]
X_train, y_train = X[-val_size:], y[-val_size:] X_val, y_val
Validation Best Practices
- Time Series Data:
- Always split the data chronologically.
- Do not shuffle the time series.
- Choose the validation size based on the forecast horizon.
- Financial Data:
- Account for different market regimes.
- Consider using multiple validation periods.
- Test on data representing various market conditions.
- Data Preparation:
- Scale features before splitting.
- Apply the same scaling parameters to the validation set.
- Ensure that the validation set is representative of future data.
- Monitoring:
- Watch the gap between training and validation loss.
- Monitor multiple performance metrics.
- Be cautious of sudden changes in validation performance.