Long Short-Term Memory Networks (LSTM)

Remi Genet

Long Short-Term Memory Networks (LSTM)

Course

Deep Learning

Understanding the mathematics and principles behind LSTM networks, their historical significance, and practical considerations.

Author

Remi Genet

Published

2025-04-03

Long Short-Term Memory Networks: The Backbone of Sequence Processing

Section 2.4 - Historical Context and Significance

Origins and Evolution

LSTMs were introduced by Hochreiter & Schmidhuber in 1997, yet they remain one of the most effective sequence processing architectures. Their endurance in the field stems from:

Robust Architecture:
- Carefully designed gating mechanisms
- Stable gradient flow
- Explicit memory management
Practical Success:
- Proven effectiveness in time series
- Strong performance in financial forecasting
- Reliable training behavior

Note

Despite being over 25 years old, LSTMs often outperform newer architectures in many financial applications, particularly in volatility forecasting and trend prediction.

Section 2.5 - LSTM Architecture Deep Dive

Core Components

An LSTM cell maintains two states: 1. Cell State ((c_t)): Long-term memory 2. Hidden State ((h_t)): Current output/short-term state

Mathematical Formulation

The LSTM updates these states through four main gates:

Forget Gate (determines what to remove from cell state): \[ f_t = \sigma\Bigl(W_f \cdot [h_{t-1}, x_t] + b_f\Bigr) \]
Input Gate (determines what new information to store): \[ i_t = \sigma\Bigl(W_i \cdot [h_{t-1}, x_t] + b_i\Bigr) \] \[ \tilde{c}_t = \tanh\Bigl(W_c \cdot [h_{t-1}, x_t] + b_c\Bigr) \]
Cell State Update: \[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \]
Output Gate (determines what parts of cell state to output): \[ o_t = \sigma\Bigl(W_o \cdot [h_{t-1}, x_t] + b_o\Bigr) \] \[ h_t = o_t \odot \tanh(c_t) \]

Where: - () is the sigmoid function: \[ \sigma(x) = \frac{1}{1+e^{-x}} \] - () is the hyperbolic tangent: \[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \] - () represents element-wise multiplication.

Activation Function Choices

The choice of activation functions is crucial for LSTM stability:

Sigmoid (()) for Gates:
- Output range ([0,1])
- Acts as soft gates
- Smooth gradients
tanh for State Transforms:
- Output range ([-1,1])
- Zero-centered
- Helps with gradient flow

Section 2.6 - Implementation in Keras

from keras import Sequential, layers

# Simple LSTM for financial time series
model = Sequential([
    layers.LSTM(64, 
                input_shape=(sequence_length, n_features),
                activation='tanh',             # State activation
                recurrent_activation='sigmoid',# Gate activation
                return_sequences=True),        # Return full sequence
    layers.LSTM(32),                         # Return only final output
    layers.Dense(1)                          # Prediction
])

Understanding Each Component:

# Detailed LSTM configuration
lstm_layer = layers.LSTM(
    units=64,                        # Size of output
    activation='tanh',               # State transform
    recurrent_activation='sigmoid',  # Gates
    use_bias=True,                   # Include bias terms
    kernel_initializer='glorot_uniform',   # Weight initialization
    recurrent_initializer='orthogonal',    # Important for stability
    bias_initializer='zeros',
    unit_forget_bias=True            # Initialize forget gate bias to 1
)

Section 2.7 - Memory Management

How LSTM Manages Information

Short-term Memory (Hidden State):
- Updated at every time step
- Directly used for outputs
- Influenced by current input and cell state
Long-term Memory (Cell State):
- Protected by gates
- Can maintain information for long sequences
- Selective updates through forget and input gates

Financial Application Example

In financial time series: - Cell state can track market regime - Forget gate can adapt to regime changes - Input gate can identify significant events - Output gate can focus on relevant features for prediction

Gradient Flow

The LSTM’s architecture helps with gradient flow through:

Additive Updates: \[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \] This additive structure creates a direct path for gradients.
Gating Mechanisms:

Gates are differentiable.
Help control gradient magnitude.
Prevent explosion/vanishing through sigmoid bounds.

Cell State Highway:

Provides a direct path through time steps.
Protected by the forget gate.
Involves minimal transformations.

Note

The careful balance of activation functions (sigmoid and tanh) combined with the gating mechanism helps maintain stable gradient flow, which is crucial for training on long sequences of financial data.

Back to top