Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting
Author
Remi Genet
Published
2025-02-18
!pip install pandas numpy matplotlib jax keras scikit-learn pyarrow
Requirement already satisfied: pandas in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (2.2.3)
Requirement already satisfied: numpy in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (2.2.2)
Requirement already satisfied: matplotlib in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (3.10.0)
Requirement already satisfied: jax in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (0.5.0)
Requirement already satisfied: keras in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (3.8.0)
Requirement already satisfied: scikit-learn in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (1.6.1)
Requirement already satisfied: pyarrow in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (19.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from pandas) (2024.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (4.55.8)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /home/remi/.local/lib/python3.12/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (3.2.1)
Requirement already satisfied: jaxlib<=0.5.0,>=0.5.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from jax) (0.5.0)
Requirement already satisfied: ml_dtypes>=0.4.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from jax) (0.4.1)
Requirement already satisfied: opt_einsum in /home/remi/.local/lib/python3.12/site-packages (from jax) (3.3.0)
Requirement already satisfied: scipy>=1.11.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from jax) (1.15.1)
Requirement already satisfied: absl-py in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (2.1.0)
Requirement already satisfied: rich in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (13.9.4)
Requirement already satisfied: namex in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (0.0.8)
Requirement already satisfied: h5py in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (3.12.1)
Requirement already satisfied: optree in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (0.14.0)
Requirement already satisfied: joblib>=1.2.0 in /home/remi/.local/lib/python3.12/site-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/remi/.local/lib/python3.12/site-packages (from scikit-learn) (3.2.0)
Requirement already satisfied: six>=1.5 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: typing-extensions>=4.5.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from optree->keras) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from rich->keras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from rich->keras) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich->keras) (0.1.2)
[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: pip install --upgrade pip
Introduction
In this practical assignment, you will implement a transformer-based architecture for multivariate time series forecasting. We will use cryptocurrency trading volume data to predict future values. The architecture is inspired by the Temporal Fusion Transformer but adapted to the task.
We’ll break down the implementation into manageable components, focusing on the core concepts we’ve covered: residual connections, gating mechanisms, and attention.
Dataset: The data is available at https://github.com/remigenet/TKAN/blob/main/examples/data.parquet and contains hourly trading volumes for various cryptocurrencies.
Setup
First, let’s import our required libraries and load the data:
import osBACKEND ='jax'os.environ['KERAS_BACKEND'] = BACKENDimport timeimport numpy as npimport pandas as pdfrom IPython.display import displayimport matplotlib.pyplot as pltfrom keras.models import Sequentialfrom keras.layers import GRU, LSTM, Dense, Flatten, Inputimport kerasfrom keras import opsfrom keras import backendfrom keras import randomimport keras from keras import opsfrom keras import Model, Inputfrom keras.layers import Layer, LSTM, Dense, Input, Add, LayerNormalization, Multiply, Reshape, Activation, TimeDistributed, Flatten, Lambda, MultiHeadAttention, Concatenatefrom sklearn.metrics import r2_scorefrom tkan import TKANN_MAX_EPOCHS =100BATCH_SIZE =128early_stopping_callback =lambda : tf.keras.callbacks.EarlyStopping( monitor="val_loss", min_delta=0.00001, patience=6, mode="min", restore_best_weights=True, start_from_epoch=6,)lr_callback =lambda : tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.25, patience=3, mode="min", min_delta=0.00001, min_lr=0.000025, verbose=0,)callbacks =lambda : [early_stopping_callback(), lr_callback(), tf.keras.callbacks.TerminateOnNaN()]
df = pd.read_parquet('/home/remi/Downloads/data.parquet')df = df[(df.index >= pd.Timestamp('2020-01-01')) & (df.index < pd.Timestamp('2023-01-01'))]assets = ['BTC', 'ETH', 'ADA', 'XMR', 'EOS', 'MATIC', 'TRX', 'FTM', 'BNB', 'XLM', 'ENJ', 'CHZ', 'BUSD', 'ATOM', 'LINK', 'ETC', 'XRP', 'BCH', 'LTC']df = df[[c for c in df.columns if'quote asset volume'in c andany(asset in c for asset in assets)]]df.columns = [c.replace(' quote asset volume', '') for c in df.columns]known_input_df = pd.DataFrame(index=df.index, data=np.array([df.reset_index()['group'].apply(lambda x: (x.hour)).values, df.reset_index()['group'].apply(lambda x: (x.dayofweek)).values]).T, columns = ['hour', 'dayofweek'])display(df)display(known_input_df)
BTC
ADA
XMR
EOS
CHZ
MATIC
TRX
ENJ
FTM
BNB
XLM
BUSD
ATOM
LTC
LINK
ETC
ETH
XRP
BCH
group
2020-01-01 00:00:00
3.675857e+06
38189.176211
4.539598e+04
94778.577031
817.146319
31003.791035
481993.354990
15241.945783
1165.788613
8.498617e+05
9460.819556
1.352376e+04
31986.972694
1.165827e+05
24281.170262
56488.402352
1.000930e+06
2.579254e+05
178258.749391
2020-01-01 01:00:00
6.365953e+06
51357.010954
3.348395e+04
593292.135445
886.460339
84465.335718
533668.554562
11896.843688
413.844612
7.405759e+05
37141.909518
2.531605e+04
81777.666046
2.830715e+05
51190.975142
182102.074213
1.474278e+06
4.520609e+05
615321.025242
2020-01-01 02:00:00
4.736719e+06
36164.263914
1.573255e+04
266732.556000
1819.795050
113379.718506
387049.986770
30109.770521
3559.965968
1.039091e+06
16878.822627
1.390886e+04
195731.175551
2.402871e+05
28721.756184
134063.422732
9.940256e+05
4.414948e+05
221535.645771
2020-01-01 03:00:00
5.667367e+06
24449.953815
2.575105e+04
124516.579473
2979.655803
41771.707995
450772.139235
6732.833578
4076.415482
4.975018e+05
9049.223394
2.251969e+04
120113.343316
1.613043e+05
29596.222534
131094.172168
6.473610e+05
1.886061e+05
397185.950571
2020-01-01 04:00:00
3.379094e+06
44502.669843
6.295563e+04
421819.671410
1023.388675
22254.756114
284788.973752
846.938455
633.367505
4.751285e+05
7254.260203
1.122460e+04
19989.169106
2.214516e+05
54514.370016
134937.122201
4.430067e+05
2.279373e+05
316499.137509
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2022-12-31 19:00:00
6.704605e+07
581680.400510
3.873989e+05
48359.865300
199491.822800
890911.573610
225136.420055
40281.859330
159553.944500
9.889098e+05
39230.588600
6.560756e+06
180809.784710
9.964355e+05
190664.976300
181340.756100
7.738029e+06
1.413563e+06
35409.149500
2022-12-31 20:00:00
4.344849e+07
323561.762270
1.379392e+05
37858.704700
173057.240300
333511.762200
157069.026827
42228.830930
270251.374500
6.032059e+05
52964.531800
7.255324e+06
276013.421720
1.173164e+06
265727.950340
90513.087600
4.278879e+06
1.113527e+06
42674.516600
2022-12-31 21:00:00
5.992803e+07
455185.698060
2.445869e+05
79538.050600
107544.609700
525037.759990
180404.744820
27446.620810
198885.610000
1.386864e+06
44485.594800
8.712142e+06
476151.071190
6.820723e+05
265687.852060
85399.066100
4.643401e+06
1.373231e+06
38027.858800
2022-12-31 22:00:00
1.106669e+08
763590.393960
1.486058e+06
119816.048000
227360.873900
940456.693720
378767.904610
37877.840280
179819.382700
1.387985e+06
43947.636100
5.517835e+06
558172.963150
8.838422e+05
678706.662170
319377.553800
1.143952e+07
3.006036e+06
50730.511300
2022-12-31 23:00:00
8.217183e+07
916995.764890
1.292763e+05
101125.121600
181794.887300
615235.478960
201939.176223
27857.744420
305403.731200
1.069728e+06
80571.059200
6.251872e+06
320158.642350
9.486568e+05
546064.757390
135623.176400
7.086714e+06
8.907798e+06
85765.409200
26304 rows × 19 columns
hour
dayofweek
group
2020-01-01 00:00:00
0
2
2020-01-01 01:00:00
1
2
2020-01-01 02:00:00
2
2
2020-01-01 03:00:00
3
2
2020-01-01 04:00:00
4
2
...
...
...
2022-12-31 19:00:00
19
5
2022-12-31 20:00:00
20
5
2022-12-31 21:00:00
21
5
2022-12-31 22:00:00
22
5
2022-12-31 23:00:00
23
5
26304 rows × 2 columns
class MinMaxScaler:def__init__(self, feature_axis=None, minmax_range=(0, 1)):""" Initialize the MinMaxScaler. Args: feature_axis (int, optional): The axis that represents the feature dimension if applicable. Use only for 3D data to specify which axis is the feature axis. Default is None, automatically managed based on data dimensions. """self.feature_axis = feature_axisself.min_ =Noneself.max_ =Noneself.scale_ =Noneself.minmax_range = minmax_range # Default range for scaling (min, max)def fit(self, X):""" Fit the scaler to the data based on its dimensionality. Args: X (np.array): The data to fit the scaler on. """if X.ndim ==3andself.feature_axis isnotNone: # 3D data axis =tuple(i for i inrange(X.ndim) if i !=self.feature_axis)self.min_ = np.min(X, axis=axis)self.max_ = np.max(X, axis=axis)elif X.ndim ==2: # 2D dataself.min_ = np.min(X, axis=0)self.max_ = np.max(X, axis=0)elif X.ndim ==1: # 1D dataself.min_ = np.min(X)self.max_ = np.max(X)else:raiseValueError("Data must be 1D, 2D, or 3D.")self.scale_ =self.max_ -self.min_returnselfdef transform(self, X):""" Transform the data using the fitted scaler. Args: X (np.array): The data to transform. Returns: np.array: The scaled data. """ X_scaled = (X -self.min_) /self.scale_ X_scaled = X_scaled * (self.minmax_range[1] -self.minmax_range[0]) +self.minmax_range[0]return X_scaleddef fit_transform(self, X):""" Fit to data, then transform it. Args: X (np.array): The data to fit and transform. Returns: np.array: The scaled data. """returnself.fit(X).transform(X)def inverse_transform(self, X_scaled):""" Inverse transform the scaled data to original data. Args: X_scaled (np.array): The scaled data to inverse transform. Returns: np.array: The original data scale. """ X = (X_scaled -self.minmax_range[0]) / (self.minmax_range[1] -self.minmax_range[0]) X = X *self.scale_ +self.min_return Xdef generate_data(df, sequence_length, n_ahead):#Case without known inputs scaler_df = df.copy().shift(n_ahead).rolling(24*14).median() tmp_df = df.copy() / scaler_df tmp_df = tmp_df.iloc[24*14+ n_ahead:].fillna(0.) scaler_df = scaler_df.iloc[24*14+ n_ahead:].fillna(0.)def prepare_sequences(df, scaler_df, n_history, n_future): X, y, y_scaler = [], [], [] num_features = df.shape[1]# Iterate through the DataFrame to create sequencesfor i inrange(n_history, len(df) - n_future +1):# Extract the sequence of past observations X.append(df.iloc[i - n_history:i].values)# Extract the future values of the first column y.append(df.iloc[i:i + n_future,0:1].values) y_scaler.append(scaler_df.iloc[i:i + n_future,0:1].values) X, y, y_scaler = np.array(X), np.array(y), np.array(y_scaler)return X, y, y_scaler# Prepare sequences X, y, y_scaler = prepare_sequences(tmp_df, scaler_df, sequence_length, n_ahead)# Split the dataset into training and testing sets train_test_separation =int(len(X) *0.8) X_train_unscaled, X_test_unscaled = X[:train_test_separation], X[train_test_separation:] y_train_unscaled, y_test_unscaled = y[:train_test_separation], y[train_test_separation:] y_scaler_train, y_scaler_test = y_scaler[:train_test_separation], y_scaler[train_test_separation:]# Generate the data X_scaler = MinMaxScaler(feature_axis=2) X_train = X_scaler.fit_transform(X_train_unscaled) X_test = X_scaler.transform(X_test_unscaled) y_scaler = MinMaxScaler(feature_axis=2) y_train = y_scaler.fit_transform(y_train_unscaled) y_test = y_scaler.transform(y_test_unscaled) y_train = y_train.reshape(y_train.shape[0], -1) y_test = y_test.reshape(y_test.shape[0], -1)return X_scaler, X_train, X_test, X_train_unscaled, X_test_unscaled, y_scaler, y_train, y_test, y_train_unscaled, y_test_unscaled, y_scaler_train, y_scaler_testdef generate_data_w_known_inputs(df, known_input_df, sequence_length, n_ahead):#Case without known inputs - fill with 0 the unknown features future values in X scaler_df = df.copy().shift(n_ahead).rolling(24*14).median() tmp_df = df.copy() / scaler_df tmp_df = tmp_df.iloc[24*14+ n_ahead:].fillna(0.) scaler_df = scaler_df.iloc[24*14+ n_ahead:].fillna(0.) tmp_known_input_df = known_input_df.iloc[24*14+ n_ahead:].copy()def prepare_sequences(df, known_input_df, scaler_df, n_history, n_future): Xu, Xk, y, y_scaler = [], [], [], [] num_features = df.shape[1]# Iterate through the DataFrame to create sequencesfor i inrange(n_history, len(df) - n_future +1):# Extract the sequence of past observations Xu.append(np.concatenate((df.iloc[i - n_history:i].values, np.zeros((n_future, df.shape[1]))), axis=0)) Xk.append(known_input_df.iloc[i - n_history:i+n_future].values)# Extract the future values of the first column y.append(df.iloc[i:i + n_future,0:1].values) y_scaler.append(scaler_df.iloc[i:i + n_future,0:1].values) Xu, Xk, y, y_scaler = np.array(Xu), np.array(Xk), np.array(y), np.array(y_scaler)return Xu, Xk, y, y_scaler# Prepare sequences Xu, Xk, y, y_scaler = prepare_sequences(tmp_df, tmp_known_input_df, scaler_df, sequence_length, n_ahead) X = np.concatenate((Xu, Xk), axis=-1)# Split the dataset into training and testing sets train_test_separation =int(len(X) *0.8) X_train_unscaled, X_test_unscaled = X[:train_test_separation], X[train_test_separation:] y_train_unscaled, y_test_unscaled = y[:train_test_separation], y[train_test_separation:] y_scaler_train, y_scaler_test = y_scaler[:train_test_separation], y_scaler[train_test_separation:]# Generate the data X_scaler = MinMaxScaler(feature_axis=2) X_train = X_scaler.fit_transform(X_train_unscaled) X_test = X_scaler.transform(X_test_unscaled) y_scaler = MinMaxScaler(feature_axis=2) y_train = y_scaler.fit_transform(y_train_unscaled) y_test = y_scaler.transform(y_test_unscaled) y_train = y_train.reshape(y_train.shape[0], -1) y_test = y_test.reshape(y_test.shape[0], -1)return X_scaler, X_train, X_test, X_train_unscaled, X_test_unscaled, y_scaler, y_train, y_test, y_train_unscaled, y_test_unscaled, y_scaler_train, y_scaler_testnum_hidden =100num_heads =4num_embedding =1n_ahead =15sequence_length =5* n_aheadX_scaler, X_train, X_test, X_train_unscaled, X_test_unscaled, y_scaler, y_train, y_test, y_train_unscaled, y_test_unscaled, y_scaler_train, y_scaler_test = generate_data_w_known_inputs(df, known_input_df, sequence_length, n_ahead)X_train.shape, X_test.shape, y_train.shape, y_test.shape
Exercise 1: Implementing a Normalization Layer with Addition
In this exercise, you will implement an AddAndNorm layer that combines addition and normalization operations. This layer represents a fundamental building block used in many neural network architectures to combine and normalize signals.
The layer should perform two operations in sequence: 1. Sum its inputs using an addition operation 2. Normalize the result using layer normalization
You need to implement a layer with three key methods: - __init__: Initialize the layer and its parent class - build: Set up the Add and LayerNormalization sublayers - call: Define the forward pass computation
Here is the class structure to complete:
@keras.utils.register_keras_serializable(name="AddAndNorm")class AddAndNorm(Layer):def__init__(self, **kwargs):"""Initialize the AddAndNorm layer. Args: **kwargs: Additional keyword arguments for the Layer base class """# TODO: Initialize the parent class using super()passdef build(self, input_shape):"""Create and configure the Add and LayerNormalization sublayers. Args: input_shape: Shape of the input tensors """# TODO: Initialize an Add layer# TODO: Initialize a LayerNormalization layer# Note: The LayerNormalization layer should be built using the output shape from Addpassdef call(self, inputs):"""Perform the forward computation: addition followed by normalization. Args: inputs: Input tensors to be combined Returns: Normalized sum of the inputs """# TODO: Sum the inputs using the Add layer# TODO: Normalize the sum using LayerNormalizationpass
This layer serves as a basic building block for combining and normalizing signals in neural networks. The normalization helps stabilize the network’s behavior during training.
Exercise 2: Implementing a Gated Linear Unit (GLU)
In this exercise, you will implement a Gated Linear Unit (GLU), which is a neural network component that controls information flow through learnable gates. The GLU performs element-wise multiplication between a linear transformation of the input and a sigmoid-gated version of the same input.
The core concept is to let the network learn which information should pass through by creating two parallel paths: 1. A linear transformation that produces candidate values 2. A sigmoid-activated transformation that produces gates between 0 and 1
The final output is the element-wise product of these two paths, allowing the network to selectively pass or block information.
Complete the implementation of the Gate layer:
@keras.utils.register_keras_serializable(name="Gate")class Gate(Layer):def__init__(self, hidden_layer_size=None, **kwargs):"""Initialize the Gate layer. Args: hidden_layer_size: Dimension of the output. If None, uses input dimension **kwargs: Additional keyword arguments for the Layer base class """# TODO: Call parent class initialization# TODO: Store hidden_layer_size parameterpassdef build(self, input_shape):"""Create the layer's weights. Args: input_shape: Shape of the input tensor This method should create: 1. A Dense layer for the linear transformation (no activation) 2. A Dense layer with sigmoid activation for the gates Both layers should output tensors with hidden_layer_size units """# TODO: Set hidden_layer_size to input dimension if not specified# TODO: Create dense layer for linear transformation# TODO: Create dense layer for gates with sigmoid activation# TODO: Build both layers with proper input shapepassdef call(self, inputs):"""Forward pass computation. Args: inputs: Input tensor Returns: Element-wise product of the linear transformation and sigmoid gates """# TODO: Apply linear transformation to create candidate values# TODO: Apply sigmoid-activated transformation to create gates# TODO: Return element-wise multiplication of candidates and gatespass
When implemented correctly, this layer will: 1. Transform the input through two parallel dense layers 2. One path creates candidate values through a linear transformation 3. The other path creates gates through a sigmoid transformation 4. The final output modulates the candidate values by the computed gates
The gating mechanism allows the network to learn which features are important for the task at hand, effectively acting as a learned feature selector.
Exercise 3: Implementing a Gated Residual Network (GRN)
In this exercise, you will implement a Gated Residual Network layer. The GRN combines several key concepts we’ve covered to create a powerful neural network building block. This architecture processes input through two paths:
A direct path that applies a linear transformation (skip connection)
A main path that applies multiple transformations and gating
The GRN processes its input through these paths and combines them using addition and normalization. This architecture is particularly effective for capturing complex patterns while maintaining stable training dynamics.
Complete the implementation of the GRN layer:
@keras.utils.register_keras_serializable(name="GRN")class GRN(Layer):def__init__(self, hidden_layer_size, output_size=None, **kwargs):"""Initialize the Gated Residual Network layer. Args: hidden_layer_size: Number of units in the hidden layers output_size: Number of output units. If None, matches hidden_layer_size **kwargs: Additional keyword arguments for the Layer base class """# TODO: Call parent class initialization# TODO: Store hidden_layer_size and output_size parameterspassdef build(self, input_shape):"""Create and configure all sublayers. Args: input_shape: Shape of the input tensor This method should initialize: 1. A skip connection (Dense layer) 2. Two hidden layers (Dense with ELU and Dense) 3. A Gate layer 4. An AddAndNorm layer """# TODO: Set output_size if not specified# TODO: Create skip connection Dense layer# TODO: Create first hidden Dense layer with ELU activation# TODO: Create second hidden Dense layer# TODO: Create Gate layer# TODO: Create AddAndNorm layer# TODO: Build all layers with appropriate shapespassdef call(self, inputs):"""Forward pass computation. Args: inputs: Input tensor Returns: Processed tensor combining both paths through addition and normalization """# TODO: Process input through skip connection# TODO: Process input through hidden layers# TODO: Apply gating to hidden layer output# TODO: Combine and normalize skip connection and gated outputpass
When implemented correctly, this layer will: 1. Transform the input through a skip connection to match the desired output size 2. Process the input through two dense layers with nonlinear activation 3. Apply gating to control information flow 4. Combine the gated output with the skip connection using addition and normalization
The GRN architecture allows the network to learn complex transformations while maintaining good gradient flow and stable training characteristics. The combination of gating and skip connections makes it particularly effective for deep architectures.
Exercise 4: Implementing the Input Embedding Layer
This exercise focuses on creating an embedding layer that handles multivariate time series data in a unique way. Unlike traditional embedding layers used in NLP that map discrete tokens to vectors, this layer learns separate transformations for each continuous feature in our time series.
Let’s understand what this layer does through an example: If we have a time series with 3 features (e.g., price, volume, and volatility) over 100 time steps, our input tensor would have shape (batch_size, 100, 3). For each of these features, we want to learn a separate transformation that captures its unique characteristics. If we set num_hidden to 32, each feature will be transformed into a 32-dimensional representation, while preserving its temporal structure.
Complete the implementation below:
@keras.utils.register_keras_serializable(name="EmbeddingLayer")class EmbeddingLayer(Layer):def__init__(self, num_hidden, **kwargs):"""Initialize the Embedding Layer. Args: num_hidden: Dimension of the transformed representation for each feature **kwargs: Additional keyword arguments for the Layer base class """# TODO: Call parent class initialization# TODO: Store the embedding dimension parameterpassdef build(self, input_shape):"""Create independent transformation layers for each input feature. For a multivariate time series with n features, this method creates n independent dense layers. Each layer will transform its respective feature from a scalar to a vector of size num_hidden. Args: input_shape: Tuple (batch_size, time_steps, n_features) """# TODO: Create a dense layer for each feature in the time series# TODO: Configure each dense layer to handle its specific featurepassdef call(self, inputs):"""Transform each feature independently. For each time series feature, this method: 1. Extracts the feature across all time steps 2. Applies its dedicated transformation 3. Stacks all transformed features together Args: inputs: Tensor of shape (batch_size, time_steps, n_features) Returns: Tensor of shape (batch_size, time_steps, num_hidden, n_features) """# TODO: Process each feature through its dedicated dense layer# TODO: Stack transformed features into final output tensorpass
The key insight of this layer is that it preserves feature independence while enriching their representations. Unlike a standard dense layer that would mix all features together, this layer ensures that each feature’s temporal patterns are processed separately before any interaction between features occurs in subsequent layers. This design is particularly valuable for time series data where different features may have very different statistical properties and temporal dynamics.
Part 2: Building the Complete Model
Now that we have our building blocks, let’s create the complete model using Keras’s Sequential API. We’ll provide you with the VariableSelectionNetwork implementation, and you’ll need to assemble everything together.
Exercise 5: Building a Transformer-Based Model for Time Series Forecasting
This exercise focuses on creating a transformer architecture specifically designed for time series forecasting. The model has a unique characteristic: it handles both “known” and “unknown” features differently in its predictions.
Let’s understand the key concepts:
Known vs Unknown Features The model processes two types of features: - Unknown features: Values that are only available up to the present (like stock prices or trading volumes) - Known features: Values that are available for future time points (like hour of day or day of week)
Time Windows The model works with two temporal segments: - Historical window: Past observations of length sequence_length - Future window: Future period of length n_ahead where we want to make predictions
Encoder-Decoder Architecture The architecture uses an encoder-decoder pattern with LSTM networks: - Encoder: Processes the historical data to create a representation of past patterns - Decoder: Uses this representation along with known future features to generate predictions
Complete this implementation that assembles all the components we’ve built:
def create_transformer_model(sequence_length, num_unknow_features, num_know_features, num_embedding, num_hidden, num_heads, n_ahead):""" Build a transformer-based forecasting model. The model processes the input through several stages: 1. Feature embedding and separation 2. Variable selection for past and future sequences 3. LSTM-based encoding and decoding 4. Attention mechanism for temporal patterns 5. Final prediction generation Args: sequence_length: Number of past time steps to use num_unknow_features: Number of observed features (e.g., trading volumes) num_know_features: Number of known features (e.g., hour, day of week) num_embedding: Size of the embedding dimension num_hidden: Number of hidden units in layers num_heads: Number of attention heads n_ahead: Number of future time steps to predict Returns: keras.Model: Assembled transformer model """# TODO: Create input layer that accepts both past and future time steps# TODO: Apply embedding to transform all features# TODO: Split embedded features into past and future sequences# Note: Future sequence only includes known features# TODO: Apply variable selection to both sequences independently# TODO: Create and apply LSTM encoder to past sequence# Important: Capture encoder states for decoder initialization# TODO: Create and apply LSTM decoder to future sequence# Important: Initialize with encoder states# TODO: Combine encoder and decoder outputs to maintain full history# TODO: Add gated residual connection between LSTM and variable selection paths# TODO: Apply GRN transformation# TODO: Apply multi-head attention to capture temporal dependencies# TODO: Create final prediction layer
Exercise 6: Training and Evaluation
Now let’s put everything together and train the model. Complete this code to: 1. Create the model with appropriate parameters 2. Compile it with suitable loss and optimizer 3. Train it on our cryptocurrency data 4. Evaluate its performance