Long Short-Term Memory Networks (LSTM)
Long Short-Term Memory Networks: The Backbone of Sequence Processing
Section 2.4 - Historical Context and Significance
Origins and Evolution
LSTMs were introduced by Hochreiter & Schmidhuber in 1997, yet they remain one of the most effective sequence processing architectures. Their endurance in the field stems from:
- Robust Architecture:
- Carefully designed gating mechanisms
- Stable gradient flow
- Explicit memory management
- Practical Success:
- Proven effectiveness in time series
- Strong performance in financial forecasting
- Reliable training behavior
Despite being over 25 years old, LSTMs often outperform newer architectures in many financial applications, particularly in volatility forecasting and trend prediction.
Section 2.5 - LSTM Architecture Deep Dive
Core Components
An LSTM cell maintains two states: 1. Cell State ((c_t)): Long-term memory 2. Hidden State ((h_t)): Current output/short-term state
Mathematical Formulation
The LSTM updates these states through four main gates:
Forget Gate (determines what to remove from cell state): \[ f_t = \sigma\Bigl(W_f \cdot [h_{t-1}, x_t] + b_f\Bigr) \]
Input Gate (determines what new information to store): \[ i_t = \sigma\Bigl(W_i \cdot [h_{t-1}, x_t] + b_i\Bigr) \] \[ \tilde{c}_t = \tanh\Bigl(W_c \cdot [h_{t-1}, x_t] + b_c\Bigr) \]
Cell State Update: \[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \]
Output Gate (determines what parts of cell state to output): \[ o_t = \sigma\Bigl(W_o \cdot [h_{t-1}, x_t] + b_o\Bigr) \] \[ h_t = o_t \odot \tanh(c_t) \]
Where: - () is the sigmoid function: \[ \sigma(x) = \frac{1}{1+e^{-x}} \] - () is the hyperbolic tangent: \[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \] - () represents element-wise multiplication.
The choice of activation functions is crucial for LSTM stability:
- Sigmoid (()) for Gates:
- Output range ([0,1])
- Acts as soft gates
- Smooth gradients
- tanh for State Transforms:
- Output range ([-1,1])
- Zero-centered
- Helps with gradient flow
Section 2.6 - Implementation in Keras
from keras import Sequential, layers
# Simple LSTM for financial time series
= Sequential([
model 64,
layers.LSTM(=(sequence_length, n_features),
input_shape='tanh', # State activation
activation='sigmoid',# Gate activation
recurrent_activation=True), # Return full sequence
return_sequences32), # Return only final output
layers.LSTM(1) # Prediction
layers.Dense( ])
Understanding Each Component:
# Detailed LSTM configuration
= layers.LSTM(
lstm_layer =64, # Size of output
units='tanh', # State transform
activation='sigmoid', # Gates
recurrent_activation=True, # Include bias terms
use_bias='glorot_uniform', # Weight initialization
kernel_initializer='orthogonal', # Important for stability
recurrent_initializer='zeros',
bias_initializer=True # Initialize forget gate bias to 1
unit_forget_bias )
Section 2.7 - Memory Management
How LSTM Manages Information
- Short-term Memory (Hidden State):
- Updated at every time step
- Directly used for outputs
- Influenced by current input and cell state
- Long-term Memory (Cell State):
- Protected by gates
- Can maintain information for long sequences
- Selective updates through forget and input gates
In financial time series: - Cell state can track market regime - Forget gate can adapt to regime changes - Input gate can identify significant events - Output gate can focus on relevant features for prediction
Gradient Flow
The LSTM’s architecture helps with gradient flow through:
Additive Updates: \[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \] This additive structure creates a direct path for gradients.
Gating Mechanisms:
- Gates are differentiable.
- Help control gradient magnitude.
- Prevent explosion/vanishing through sigmoid bounds.
- Cell State Highway:
- Provides a direct path through time steps.
- Protected by the forget gate.
- Involves minimal transformations.
The careful balance of activation functions (sigmoid and tanh) combined with the gating mechanism helps maintain stable gradient flow, which is crucial for training on long sequences of financial data.