Convolutional Layers: From Images to Time Series
Convolutional Operations: A Unified Mathematical Framework
Section 4.9 - The Mathematical Foundation of Convolutions
At its core, a convolution is an operation between two functions that produces a third function expressing how the shape of one is modified by the other. In deep learning, we use discrete convolutions where one function is our input data and the other is our learnable kernel.
The Basic Convolution Operation
For a 1-dimensional input signal (x) and a kernel (w), the convolution operation is defined as:
\[ (x * w)(t) = \sum_{k} x(t - k) \, w(k) \]
In practice, we often work with finite, discrete signals. For an input vector (x ^n) and a kernel (w ^k), the discrete convolution becomes:
\[ y[t] = \sum_{k=0}^{k-1} x[t - k] \, w[k], \]
where (k) is the kernel size.
Section 4.10 - From 1D to Multi-Dimensional Convolutions
While convolutions are often associated with image processing (2D convolutions), the operation generalizes naturally across dimensions:
1D Convolution (Time Series)
Used for temporal data, where the convolution slides over time:
\[ y[t] = \sum_{k} x[t - k] \, w[k] \]
2D Convolution (Images)
For spatial data with input (X) and kernel (W):
\[ Y[i,j] = \sum_{m} \sum_{n} X[i - m,\, j - n] \, W[m,n] \]
3D Convolution (Videos/Volumes)
Extends to three dimensions for spatio-temporal or volumetric data:
\[ Y[i,j,k] = \sum_{l} \sum_{m} \sum_{n} X[i - l,\, j - m,\, k - n] \, W[l,m,n] \]
Section 4.11 - Convolutions in Time Series Analysis
In time series analysis, 1D convolutions serve several crucial purposes:
Moving Average as Convolution
A simple moving average can be expressed as a convolution with a uniform kernel:
\[ w = \left[\frac{1}{k},\, \frac{1}{k},\, \dots,\, \frac{1}{k}\right] \]
The output at each point becomes an average of (k) surrounding points:
\[ y[t] = \frac{1}{k} \sum_{i=0}^{k-1} x[t - i] \]
Learnable Temporal Patterns
In neural networks, the kernel weights are learned from data. A 1D convolutional layer with input (x ^n) and (c) kernels (w_{(i)} ^k) produces output:
\[ y_{(i)}[t] = \sigma\Bigl(\sum_{k} x[t - k] \, w_{(i)}[k] + b_{(i)}\Bigr) \]
where: - () is a nonlinear activation function, - (b_{(i)}) is a learnable bias term, - (i) ranges from 1 to (c) (number of output channels).
This operation can learn to detect various temporal patterns: - Short-term dependencies: Captured with small kernel sizes. - Long-term patterns: Captured using dilated convolutions. - Multi-scale features: Achieved using parallel convolutions with different kernel sizes.
Dilated Convolutions
To capture long-range dependencies without increasing the parameter count, dilated convolutions introduce gaps in the kernel:
\[ y[t] = \sum_{k} x[t - d\, k] \, w[k], \]
where (d) is the dilation rate. This effectively increases the receptive field exponentially with layer depth while maintaining computational efficiency.
Section 4.12 - Theoretical Properties
Convolutions possess several important properties that make them particularly effective for pattern recognition:
Translation Equivariance: If the input is shifted by (), the output shifts by ():
\[ \operatorname{Conv}(T_\delta x) = T_\delta \operatorname{Conv}(x), \]
where (T_) represents translation by ().
Local Connectivity: Each output point depends only on a local region of the input, reducing computational complexity.
Parameter Sharing: The same kernel is applied across all positions, dramatically reducing the number of parameters compared to fully connected layers.
These properties make convolutional layers particularly effective for tasks where patterns may appear at different positions in the input sequence, while maintaining both computational and statistical efficiency in learning.