DeepLearning For Finance
  • Back to Main Website
  • Home
  • Introduction to Deep Learning
    • Introduction to Deep Learning
    • From Traditional Models to Deep Learning
    • The Multi-Layer Perceptron (MLP)
    • Automatic Differentiation: The Engine of Deep Learning
    • Computation Backends & Keras 3
    • GPUs and Deep Learning: When Hardware Matters
    • Keras Fundamentals: Models & Layers
    • Keras Matrix Operations: The Building Blocks
    • Activation Functions: Adding Non-linearity
    • Model Training Fundamentals

    • Travaux Pratiques
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
    • TP1 Corrected: Building Neural Networks - From Simple to Custom Implementations
  • Recurrent Neural Networks
    • Recurrent Neural Networks
    • Sequential Data Processing: From MLPs to RNNs
    • Long Short-Term Memory Networks (LSTM)
    • Modern RNN Architectures
    • RNN Limitations: Computational Challenges

    • Travaux Pratiques
    • TP: Recurrent Neural Networks for Time Series Prediction
    • TP Corrected: Recurrent Neural Networks for Time Series Prediction
  • Training a Neural Network
    • Training a Neural Network
    • Understanding the Training Loop
    • Understanding Optimizers
    • Understanding Callbacks
    • Training Parameters and Practical Considerations

    • Travaux Pratiques
    • TP: Using Deep Learning Frameworks for General Optimization
    • tp_general_optimization_corrected.html
    • TP: Impact of Callbacks on Training
  • Essential Building Blocks of Modern Neural Networks
    • Essential Building Blocks of Modern Neural Networks
    • Residual Connections and Gating Mechanisms
    • Convolutional Layers: From Images to Time Series
    • Neural Network Embeddings: Learning Meaningful Representations
    • Attention Mechanisms: Learning What to Focus On
    • Encoder-Decoder Architectures

    • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
  • Projets
    • Projets
  • Code source
  1. Attention Mechanisms: Learning What to Focus On
  • Essential Building Blocks of Modern Neural Networks
  • Residual Connections and Gating Mechanisms
  • Convolutional Layers: From Images to Time Series
  • Neural Network Embeddings: Learning Meaningful Representations
  • Attention Mechanisms: Learning What to Focus On
  • Encoder-Decoder Architectures
  • Travaux Pratiques
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
    • Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting

On this page

  • Attention Mechanisms: Adaptive Information Processing
    • Section 4.17 - The Attention Principle
    • Section 4.18 - Mathematical Framework of Self-Attention
    • Section 4.19 - Multi-Head Attention
    • Section 4.20 - Attention Masks
    • Section 4.21 - Properties and Advantages

Attention Mechanisms: Learning What to Focus On

Course
Advanced Concepts
Understanding attention mechanisms in neural networks: from basic self-attention to multi-head architectures.
Author

Remi Genet

Published

2025-04-03

Attention Mechanisms: Adaptive Information Processing

Section 4.17 - The Attention Principle

Traditional neural networks process all input elements equally, which can be inefficient when not all information is equally relevant. Attention mechanisms solve this by learning to focus on the pertinent parts of the input. They compute dynamic weights that determine how much each input element should influence the output.

Section 4.18 - Mathematical Framework of Self-Attention

The core attention mechanism transforms a set of key–value pairs and a query into an output. Each query (q) interacts with all keys (k) to determine attention weights for the corresponding values (v).

The attention function can be expressed mathematically as:

\[ A(Q, K, V) = \text{softmax}\Biggl(\frac{QK^T}{\sqrt{d}}\Biggr)V, \]

where: - (Q ^{n d}) represents queries, - (K ^{m d}) represents keys, - (V ^{m d}) represents values, - (d) is the dimension of the key vectors, - () is a scaling factor that stabilizes gradients.

The scaling factor () prevents the dot products from growing too large in magnitude, which would lead to extremely small gradients through the softmax function.

Section 4.19 - Multi-Head Attention

Multi-head attention extends the basic attention mechanism by allowing the model to focus on different aspects of the input simultaneously. For (h) heads, we compute:

\[ \text{MultiHead}(Q, K, V) = \bigl[H_1, \ldots, H_h\bigr]W_H, \]

where each head is computed as:

\[ H_i = \text{Attention}\Bigl(QW_Q^{(i)},\, KW_K^{(i)},\, VW_V^{(i)}\Bigr). \]

The learnable parameter matrices have the following dimensions: \[ W_Q^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_k}, \] \[ W_K^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_k}, \] \[ W_V^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_v}, \] \[ W_H \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}. \]

Here, (d_{}) is the model’s total dimension, (d_k) is the dimension of the key (and query) vectors per head, and (d_v) is the dimension of the value vectors per head.

Section 4.20 - Attention Masks

Attention masks allow us to control which inputs can attend to which other inputs. A mask (M) is a binary matrix applied to the attention weights before the softmax:

\[ \text{Masked\_Attention}(Q, K, V) = \text{softmax}\Biggl(\frac{QK^T}{\sqrt{d}} + M\Biggr)V, \]

where (M) contains (-) for positions that should be masked out. After the softmax, these positions will have an attention weight of zero.

Common masking patterns include: 1. Causal masking for autoregressive models, preventing attention to future positions. 2. Padding masks to ignore placeholder tokens. 3. Task-specific masks to enforce desired attention patterns.

Section 4.21 - Properties and Advantages

Attention mechanisms offer several important properties:

  1. Dynamic Computation: Unlike fixed-weight operations, attention weights adapt to each input.
  2. Global Dependencies: Each output can depend directly on any input, regardless of distance.
  3. Parallel Computation: The attention operation can be computed for all queries simultaneously.
  4. Interpretability: Attention weights provide insight into which inputs the model considers important.

The flexibility and power of attention mechanisms have made them fundamental building blocks in modern neural architectures, particularly in sequence processing tasks where capturing long-range dependencies is crucial.

Back to top
Neural Network Embeddings: Learning Meaningful Representations
Encoder-Decoder Architectures

Deep Learning For Finance, Rémi Genet.
Licence
Code source disponible sur Github

 

Site construit avec et Quarto
Inspiration pour la mise en forme du site ici
Code source disponible sur GitHub