Attention Mechanisms: Learning What to Focus On
Attention Mechanisms: Adaptive Information Processing
Section 4.17 - The Attention Principle
Traditional neural networks process all input elements equally, which can be inefficient when not all information is equally relevant. Attention mechanisms solve this by learning to focus on the pertinent parts of the input. They compute dynamic weights that determine how much each input element should influence the output.
Section 4.18 - Mathematical Framework of Self-Attention
The core attention mechanism transforms a set of key–value pairs and a query into an output. Each query (q) interacts with all keys (k) to determine attention weights for the corresponding values (v).
The attention function can be expressed mathematically as:
\[ A(Q, K, V) = \text{softmax}\Biggl(\frac{QK^T}{\sqrt{d}}\Biggr)V, \]
where: - (Q ^{n d}) represents queries, - (K ^{m d}) represents keys, - (V ^{m d}) represents values, - (d) is the dimension of the key vectors, - () is a scaling factor that stabilizes gradients.
The scaling factor () prevents the dot products from growing too large in magnitude, which would lead to extremely small gradients through the softmax function.
Section 4.19 - Multi-Head Attention
Multi-head attention extends the basic attention mechanism by allowing the model to focus on different aspects of the input simultaneously. For (h) heads, we compute:
\[ \text{MultiHead}(Q, K, V) = \bigl[H_1, \ldots, H_h\bigr]W_H, \]
where each head is computed as:
\[ H_i = \text{Attention}\Bigl(QW_Q^{(i)},\, KW_K^{(i)},\, VW_V^{(i)}\Bigr). \]
The learnable parameter matrices have the following dimensions: \[ W_Q^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_k}, \] \[ W_K^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_k}, \] \[ W_V^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_v}, \] \[ W_H \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}. \]
Here, (d_{}) is the model’s total dimension, (d_k) is the dimension of the key (and query) vectors per head, and (d_v) is the dimension of the value vectors per head.
Section 4.20 - Attention Masks
Attention masks allow us to control which inputs can attend to which other inputs. A mask (M) is a binary matrix applied to the attention weights before the softmax:
\[ \text{Masked\_Attention}(Q, K, V) = \text{softmax}\Biggl(\frac{QK^T}{\sqrt{d}} + M\Biggr)V, \]
where (M) contains (-) for positions that should be masked out. After the softmax, these positions will have an attention weight of zero.
Common masking patterns include: 1. Causal masking for autoregressive models, preventing attention to future positions. 2. Padding masks to ignore placeholder tokens. 3. Task-specific masks to enforce desired attention patterns.
Section 4.21 - Properties and Advantages
Attention mechanisms offer several important properties:
- Dynamic Computation: Unlike fixed-weight operations, attention weights adapt to each input.
- Global Dependencies: Each output can depend directly on any input, regardless of distance.
- Parallel Computation: The attention operation can be computed for all queries simultaneously.
- Interpretability: Attention weights provide insight into which inputs the model considers important.
The flexibility and power of attention mechanisms have made them fundamental building blocks in modern neural architectures, particularly in sequence processing tasks where capturing long-range dependencies is crucial.