Neural Network Embeddings: Learning Meaningful Representations
Embeddings: Mapping Entities to Vector Spaces
Section 4.13 - The Embedding Principle
Neural networks operate on continuous numerical values, yet many real-world inputs are discrete entities: words, categories, user IDs, or market symbols. Embeddings solve this fundamental mismatch by learning continuous vector representations of discrete entities.
An embedding is formally a mapping function \[ E: X \to \mathbb{R}^d \] that transforms elements from a discrete set (X) into (d)-dimensional real vectors. The key insight is that these vectors are learned during training to capture meaningful relationships between entities.
Section 4.14 - Mathematical Framework
One-Hot Encoding Limitation
Traditional one-hot encoding represents a categorical variable with ( n ) possible values as:
\[ e_i = [0, \ldots, 0, 1, 0, \ldots, 0] \in \mathbb{R}^n \]
where the 1 appears in the ( i )-th position. This representation has several limitations: - Dimensionality grows linearly with vocabulary size. - It provides no notion of similarity between entities. - The sparse representation wastes computational resources.
Learned Embeddings
Instead, we learn a dense embedding matrix \[ W \in \mathbb{R}^{n \times d} \] where ( d n ). For an input ( i ), its embedding becomes:
\[ x_i = W e_i \in \mathbb{R}^d. \]
This transformation offers several advantages: 1. Reduced Dimensionality: ( d n ). 2. Dense Representation: Enables efficient computation. 3. Learned Similarities: Captures relationships between entities. 4. Continuous Space: Supports gradient-based optimization.
Section 4.15 - Training Embeddings
Embeddings are learned end-to-end with the neural network through gradient descent. For a loss function ( L ), the gradient with respect to the embedding parameters flows through:
\[ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial x_i} \cdot \frac{\partial x_i}{\partial W} = \frac{\partial L}{\partial x_i} \cdot e_i^\top. \]
This learning process adjusts the embeddings to minimize the task loss while capturing useful relationships between entities.
Geometric Interpretation
The geometry of the embedding space reflects semantic relationships. For instance, if entities ( a ) and ( b ) are represented by vectors ( x_a ) and ( x_b ), their relationship can be measured by:
- Euclidean Distance: \[ \| x_a - x_b \|_2. \]
- Cosine Similarity: \[ \frac{x_a \cdot x_b}{\|x_a\|_2 \|x_b\|_2}. \]
- Dot Product: \[ x_a \cdot x_b. \]
These metrics capture different aspects of similarity in the embedded space.
Section 4.16 - Applications Beyond Categorical Variables
While embeddings were initially developed for categorical variables, their applications extend much further.
Numerical Feature Embedding
Even continuous features can benefit from embeddings. For a numerical value ( v ), we can learn a nonlinear embedding:
\[ E(v) = W_2\, \sigma(W_1 v + b_1) + b_2, \]
where ( ) is a nonlinear activation function. This allows the network to learn a more expressive representation of the feature.
Time Embeddings
In sequential models, time itself can be embedded. Given a timestamp ( t ), we can create positional embeddings:
\[ E(t) = \bigl[\sin(\omega_k t),\, \cos(\omega_k t)\bigr]_{k=1}^{d/2}, \]
where ( _k ) are different frequencies. This formulation captures both absolute and relative temporal positions.
Entity Embeddings
In financial applications, market symbols, sectors, or other discrete entities can be embedded to capture inherent relationships:
- Assets within the same sector should have similar embeddings.
- Companies with similar market behavior should be close in embedding space.
- The embedding can capture complex relationships not explicitly encoded in the data.
The dimensionality ( d ) of these embeddings is a hyperparameter that balances: - Representational Capacity (larger ( d )), - Computational Efficiency (smaller ( d )), and - Generalization (to avoid overfitting).
The power of embeddings lies in their ability to learn meaningful representations automatically from data, capturing complex relationships in a form that neural networks can effectively process.