Modern RNN Architectures

Remi Genet

Modern RNN Architectures

Course

Deep Learning

Understanding modern RNN variants and their mathematical foundations.

Author

Remi Genet

Published

2025-04-03

Modern RNN Architectures

Section 2.8 - Gated Recurrent Unit (GRU)

The GRU was introduced as a simpler alternative to LSTM, offering similar capabilities with fewer parameters. The key idea is to merge the cell state and hidden state while maintaining effective control over information flow.

Core Mathematical Components

Update Gate ((z_t)):

\[ z_t = \sigma\Bigl(W_z \cdot [h_{t-1}, x_t] + b_z\Bigr) \]

Controls how much of the previous state to keep.
Reset Gate ((r_t)):

\[ r_t = \sigma\Bigl(W_r \cdot [h_{t-1}, x_t] + b_r\Bigr) \]

Controls how much of the previous state to forget.
New Memory Content ((_t)):

\[ \tilde{h}_t = \tanh\Bigl(W \cdot [r_t \odot h_{t-1}, x_t] + b\Bigr) \]
Final Update:

\[ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \]

Key Design Choices

Gate Fusion: Combines update and output gates into one.
State Fusion: Uses a single state vector instead of separate cell and hidden states.
Direct Skip Connection: Allows unimpeded information flow through time.

Section 2.9 - Recent Innovations: TKAN

TKAN (Temporal Kolmogorov-Arnold Network) represents a novel approach combining classical RNN concepts with KAN principles.

Mathematical Foundation

KAN Base Layer:

\[ f(x) = \sum_q \Phi_q\Bigl(\sum_p \phi_{q,p}(x_p)\Bigr) \]
Temporal Extension:

\[ s_t = W_x \cdot x_t + W_h \cdot h_{t-1} \]

\[ h_t = \operatorname{KAN}(s_t) \]

Memory Management

RKAN Component:

\[ \tilde{h}_t = W_{hh} \cdot h_{t-1} + W_{hz} \cdot \tilde{o}_t \]
Gating Mechanism:

\[ f_t = \sigma\Bigl(W_f \cdot x_t + U_f \cdot h_{t-1}\Bigr) \quad \text{(Forget gate)} \]

\[ i_t = \sigma\Bigl(W_i \cdot x_t + U_i \cdot h_{t-1}\Bigr) \quad \text{(Input gate)} \]

Architectural Benefits

Learnable Activation Functions:
- KAN layers learn optimal transformations.
- Better adaptation to data patterns.
Enhanced Memory:
- Multiple memory paths through KAN sublayers.
- More stable gradient flow.

Section 2.10 - Comparative Analysis

Memory Management Approaches

GRU:
- Uses a single state vector.
- Two gates (update and reset) with direct state updates.
TKAN:
- Employs multiple KAN sublayers with learnable transformations.
- Provides complex memory paths.

Mathematical Characteristics

GRU Gradient Path:

\[ \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_t} + (1 - z_t) \frac{\partial L}{\partial h_{t+1}} \]

This equation illustrates a clear gradient flow through time.
TKAN Gradient Path:

\[ \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_t} + \sum \Bigl(W_i \cdot \frac{\partial L}{\partial h_{t+1}}\Bigr) \]

Multiple pathways contribute to the gradient flow.

Practical Considerations

GRU:
- Simpler implementation.
- Well-suited for medium-length sequences.
- Efficient training.
TKAN:
- Better for modeling complex patterns.
- Involves more parameters to tune.
- Potentially offers better generalization.

Section 2.11 - Evolution of RNN Architectures

The progression from simple RNNs to modern architectures like TKAN shows a trend toward:

Better Memory Management:
- Evolving from simple state updates to sophisticated gating.
- Incorporating multiple pathways for information flow.
Improved Gradient Flow:
- Utilizing skip connections and multiple timescale processing.
Adaptive Processing:
- Leveraging learnable transformations.
- Enabling context-dependent behavior.

Note

The key insight is that all these architectures are fundamentally different mathematical approaches to solving the same core problems: 1. Managing information flow through time. 2. Balancing short- and long-term dependencies. 3. Maintaining stable gradient flow.

Back to top