Training Parameters and Practical Considerations

Remi Genet

Training Parameters and Practical Considerations

Course

Fundamentals

Understanding key training parameters and practical considerations in deep learning.

Author

Remi Genet

Published

2025-03-06

Training Parameters and Practical Considerations

Section 3.14 - Feature Scaling: A Critical Step

Why Scaling is Crucial

Feature scaling is arguably the most important preprocessing step in deep learning. Consider a network with inputs ( x_1 ) and ( x_2 ):

\[ h = \tanh(w_1 x_1 + w_2 x_2 + b) \]

Problems without scaling:

Gradient Issues:
- If ( x_1 ) and ( x_2 )
- The gradients ( ) and ( ) will have vastly different magnitudes
- This imbalance makes optimization nearly impossible
Activation Function Saturation:
- Large input values can push activations into their saturation regions.
- For example, \[ \tanh(1000) \approx \tanh(10000) \approx 1 \]
- This leads to vanishing gradients.

Impact on Training

Without proper scaling: - The optimizer struggles to find good solutions. - Training becomes unstable. - The model might not converge at all.

Scaling Methods

Standardization (Preferred for deep learning):
```
X_scaled = (X - mean) / std
```
- Centers data around 0.
- Scales data to have unit variance.
- Works well with common activation functions.
Min-Max Scaling:
```
X_scaled = (X - X.min()) / (X.max() - X.min())
```
- Maps features to the ([0, 1]) interval.
- Useful when bounded outputs are required.

Section 3.15 - Batch Size: Finding the Sweet Spot

Batch Size Selection Principles

A good rule of thumb is to choose the largest batch size that yields between 100 and 500 batches per epoch. For example:

def optimal_batch_size(n_samples, target_batches=300):
    batch_size = n_samples // target_batches
    # Round to the nearest power of 2 for GPU efficiency
    return 2**int(np.log2(batch_size))

Example calculations: - 100,000 samples → batch_size ≈ 256 (391 batches) - 10,000 samples → batch_size ≈ 32 (313 batches) - 1,000,000 samples → batch_size ≈ 2048 (488 batches)

Why This Range?

Too few batches (<100): Not enough updates per epoch.
Too many batches (>500): Training becomes unnecessarily slow.
Sweet spot: Provides a good balance between speed and stability.

Impact on Training

Statistical Effects:
- Larger batches yield more precise gradient estimates.
- Smaller batches introduce noise, which can help in exploring the loss landscape.
Optimization Effects:
- Larger batches might require higher learning rates.
- Smaller batches have an inherent regularization effect.

Section 3.16 - Weight Initialization

The Importance of Proper Initialization

Poor weight initialization can lead to: 1. Vanishing gradients 2. Exploding gradients 3. Dead neurons (especially with ReLU activations)

Key Initialization Methods

Xavier/Glorot Initialization (for tanh or sigmoid activations):

std = sqrt(2.0 / (fan_in + fan_out))
W = np.random.normal(0, std, size=(fan_in, fan_out))

He Initialization (for ReLU activations):

std = sqrt(2.0 / fan_in)
W = np.random.normal(0, std, size=(fan_in, fan_out))

Initialization Guidelines

For ReLU, use He initialization.
For tanh or sigmoid, use Xavier (Glorot) initialization.
For linear activations, Xavier initialization is also recommended.

Layer-Specific Considerations

# Dense layer with ReLU activation
layer = Dense(
    units=64,
    kernel_initializer='he_normal',
    bias_initializer='zeros'
)

# Dense layer with tanh activation
layer = Dense(
    units=64,
    kernel_initializer='glorot_normal',
    bias_initializer='zeros'
)

Section 3.17 - Validation Strategy

Train-Validation Split

For many applications, a simple holdout validation set is sufficient. For example, in time series data you might use:

# Using the last 20% of data for validation
val_size = int(len(X) * 0.2)
X_train, y_train = X[:-val_size], y[:-val_size]
X_val, y_val = X[-val_size:], y[-val_size:]

Validation Best Practices

Time Series Data:
- Always split the data chronologically.
- Do not shuffle the time series.
- Choose the validation size based on the forecast horizon.
Financial Data:
- Account for different market regimes.
- Consider using multiple validation periods.
- Test on data representing various market conditions.

Practical Tips

Data Preparation:
- Scale features before splitting.
- Apply the same scaling parameters to the validation set.
- Ensure that the validation set is representative of future data.
Monitoring:
- Watch the gap between training and validation loss.
- Monitor multiple performance metrics.
- Be cautious of sudden changes in validation performance.

Back to top