Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting

Remi Genet

Practical Assignment: Building a Transformer-Based Architecture for Time Series Forecasting

Author

Remi Genet

Published

2025-04-03

!pip install pandas numpy matplotlib jax keras scikit-learn pyarrow

Requirement already satisfied: pandas in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (2.2.3)
Requirement already satisfied: numpy in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (2.2.2)
Requirement already satisfied: matplotlib in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (3.10.0)
Requirement already satisfied: jax in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (0.5.0)
Requirement already satisfied: keras in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (3.8.0)
Requirement already satisfied: scikit-learn in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (1.6.1)
Requirement already satisfied: pyarrow in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (19.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from pandas) (2024.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (4.55.8)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /home/remi/.local/lib/python3.12/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from matplotlib) (3.2.1)
Requirement already satisfied: jaxlib<=0.5.0,>=0.5.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from jax) (0.5.0)
Requirement already satisfied: ml_dtypes>=0.4.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from jax) (0.4.1)
Requirement already satisfied: opt_einsum in /home/remi/.local/lib/python3.12/site-packages (from jax) (3.3.0)
Requirement already satisfied: scipy>=1.11.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from jax) (1.15.1)
Requirement already satisfied: absl-py in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (2.1.0)
Requirement already satisfied: rich in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (13.9.4)
Requirement already satisfied: namex in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (0.0.8)
Requirement already satisfied: h5py in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (3.12.1)
Requirement already satisfied: optree in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from keras) (0.14.0)
Requirement already satisfied: joblib>=1.2.0 in /home/remi/.local/lib/python3.12/site-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/remi/.local/lib/python3.12/site-packages (from scikit-learn) (3.2.0)
Requirement already satisfied: six>=1.5 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: typing-extensions>=4.5.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from optree->keras) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from rich->keras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from rich->keras) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /home/remi/.pyenv/versions/3.12.3/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich->keras) (0.1.2)

[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: pip install --upgrade pip

Introduction

In this practical assignment, you will implement a transformer-based architecture for multivariate time series forecasting. We will use cryptocurrency trading volume data to predict future values. The architecture is inspired by the Temporal Fusion Transformer but adapted to the task.

We’ll break down the implementation into manageable components, focusing on the core concepts we’ve covered: residual connections, gating mechanisms, and attention.

Dataset: The data is available at https://github.com/remigenet/TKAN/blob/main/examples/data.parquet and contains hourly trading volumes for various cryptocurrencies.

Setup

First, let’s import our required libraries and load the data:

import os
BACKEND = 'jax'
os.environ['KERAS_BACKEND'] = BACKEND

import time
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers import GRU, LSTM, Dense, Flatten, Input


import keras
from keras import ops
from keras import backend
from keras import random



import keras 
from keras import ops
from keras import Model, Input
from keras.layers import Layer, LSTM, Dense, Input, Add, LayerNormalization, Multiply, Reshape, Activation, TimeDistributed, Flatten, Lambda, MultiHeadAttention, Concatenate


from sklearn.metrics import r2_score

from tkan import TKAN

N_MAX_EPOCHS = 100
BATCH_SIZE = 128
early_stopping_callback = lambda : tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0.00001,
    patience=6,
    mode="min",
    restore_best_weights=True,
    start_from_epoch=6,
)
lr_callback = lambda : tf.keras.callbacks.ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.25,
    patience=3,
    mode="min",
    min_delta=0.00001,
    min_lr=0.000025,
    verbose=0,
)
callbacks = lambda : [early_stopping_callback(), lr_callback(), tf.keras.callbacks.TerminateOnNaN()]

df = pd.read_parquet('/home/remi/Downloads/data.parquet')
df = df[(df.index >= pd.Timestamp('2020-01-01')) & (df.index < pd.Timestamp('2023-01-01'))]
assets = ['BTC', 'ETH', 'ADA', 'XMR', 'EOS', 'MATIC', 'TRX', 'FTM', 'BNB', 'XLM', 'ENJ', 'CHZ', 'BUSD', 'ATOM', 'LINK', 'ETC', 'XRP', 'BCH', 'LTC']
df = df[[c for c in df.columns if 'quote asset volume' in c and any(asset in c for asset in assets)]]
df.columns = [c.replace(' quote asset volume', '') for c in df.columns]
known_input_df = pd.DataFrame(index=df.index, data=np.array([df.reset_index()['group'].apply(lambda x: (x.hour)).values, df.reset_index()['group'].apply(lambda x: (x.dayofweek)).values]).T, columns = ['hour', 'dayofweek'])
display(df)
display(known_input_df)

	BTC	ADA	XMR	EOS	CHZ	MATIC	TRX	ENJ	FTM	BNB	XLM	BUSD	ATOM	LTC	LINK	ETC	ETH	XRP	BCH
group
2020-01-01 00:00:00	3.675857e+06	38189.176211	4.539598e+04	94778.577031	817.146319	31003.791035	481993.354990	15241.945783	1165.788613	8.498617e+05	9460.819556	1.352376e+04	31986.972694	1.165827e+05	24281.170262	56488.402352	1.000930e+06	2.579254e+05	178258.749391
2020-01-01 01:00:00	6.365953e+06	51357.010954	3.348395e+04	593292.135445	886.460339	84465.335718	533668.554562	11896.843688	413.844612	7.405759e+05	37141.909518	2.531605e+04	81777.666046	2.830715e+05	51190.975142	182102.074213	1.474278e+06	4.520609e+05	615321.025242
2020-01-01 02:00:00	4.736719e+06	36164.263914	1.573255e+04	266732.556000	1819.795050	113379.718506	387049.986770	30109.770521	3559.965968	1.039091e+06	16878.822627	1.390886e+04	195731.175551	2.402871e+05	28721.756184	134063.422732	9.940256e+05	4.414948e+05	221535.645771
2020-01-01 03:00:00	5.667367e+06	24449.953815	2.575105e+04	124516.579473	2979.655803	41771.707995	450772.139235	6732.833578	4076.415482	4.975018e+05	9049.223394	2.251969e+04	120113.343316	1.613043e+05	29596.222534	131094.172168	6.473610e+05	1.886061e+05	397185.950571
2020-01-01 04:00:00	3.379094e+06	44502.669843	6.295563e+04	421819.671410	1023.388675	22254.756114	284788.973752	846.938455	633.367505	4.751285e+05	7254.260203	1.122460e+04	19989.169106	2.214516e+05	54514.370016	134937.122201	4.430067e+05	2.279373e+05	316499.137509
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-12-31 19:00:00	6.704605e+07	581680.400510	3.873989e+05	48359.865300	199491.822800	890911.573610	225136.420055	40281.859330	159553.944500	9.889098e+05	39230.588600	6.560756e+06	180809.784710	9.964355e+05	190664.976300	181340.756100	7.738029e+06	1.413563e+06	35409.149500
2022-12-31 20:00:00	4.344849e+07	323561.762270	1.379392e+05	37858.704700	173057.240300	333511.762200	157069.026827	42228.830930	270251.374500	6.032059e+05	52964.531800	7.255324e+06	276013.421720	1.173164e+06	265727.950340	90513.087600	4.278879e+06	1.113527e+06	42674.516600
2022-12-31 21:00:00	5.992803e+07	455185.698060	2.445869e+05	79538.050600	107544.609700	525037.759990	180404.744820	27446.620810	198885.610000	1.386864e+06	44485.594800	8.712142e+06	476151.071190	6.820723e+05	265687.852060	85399.066100	4.643401e+06	1.373231e+06	38027.858800
2022-12-31 22:00:00	1.106669e+08	763590.393960	1.486058e+06	119816.048000	227360.873900	940456.693720	378767.904610	37877.840280	179819.382700	1.387985e+06	43947.636100	5.517835e+06	558172.963150	8.838422e+05	678706.662170	319377.553800	1.143952e+07	3.006036e+06	50730.511300
2022-12-31 23:00:00	8.217183e+07	916995.764890	1.292763e+05	101125.121600	181794.887300	615235.478960	201939.176223	27857.744420	305403.731200	1.069728e+06	80571.059200	6.251872e+06	320158.642350	9.486568e+05	546064.757390	135623.176400	7.086714e+06	8.907798e+06	85765.409200

26304 rows × 19 columns

	hour	dayofweek
group
2020-01-01 00:00:00	0	2
2020-01-01 01:00:00	1	2
2020-01-01 02:00:00	2	2
2020-01-01 03:00:00	3	2
2020-01-01 04:00:00	4	2
...	...	...
2022-12-31 19:00:00	19	5
2022-12-31 20:00:00	20	5
2022-12-31 21:00:00	21	5
2022-12-31 22:00:00	22	5
2022-12-31 23:00:00	23	5

26304 rows × 2 columns

class MinMaxScaler:
    def __init__(self, feature_axis=None, minmax_range=(0, 1)):
        """
        Initialize the MinMaxScaler.
        Args:
        feature_axis (int, optional): The axis that represents the feature dimension if applicable.
                                      Use only for 3D data to specify which axis is the feature axis.
                                      Default is None, automatically managed based on data dimensions.
        """
        self.feature_axis = feature_axis
        self.min_ = None
        self.max_ = None
        self.scale_ = None
        self.minmax_range = minmax_range # Default range for scaling (min, max)

    def fit(self, X):
        """
        Fit the scaler to the data based on its dimensionality.
        Args:
        X (np.array): The data to fit the scaler on.
        """
        if X.ndim == 3 and self.feature_axis is not None:  # 3D data
            axis = tuple(i for i in range(X.ndim) if i != self.feature_axis)
            self.min_ = np.min(X, axis=axis)
            self.max_ = np.max(X, axis=axis)
        elif X.ndim == 2:  # 2D data
            self.min_ = np.min(X, axis=0)
            self.max_ = np.max(X, axis=0)
        elif X.ndim == 1:  # 1D data
            self.min_ = np.min(X)
            self.max_ = np.max(X)
        else:
            raise ValueError("Data must be 1D, 2D, or 3D.")

        self.scale_ = self.max_ - self.min_
        return self

    def transform(self, X):
        """
        Transform the data using the fitted scaler.
        Args:
        X (np.array): The data to transform.
        Returns:
        np.array: The scaled data.
        """
        X_scaled = (X - self.min_) / self.scale_
        X_scaled = X_scaled * (self.minmax_range[1] - self.minmax_range[0]) + self.minmax_range[0]
        return X_scaled

    def fit_transform(self, X):
        """
        Fit to data, then transform it.
        Args:
        X (np.array): The data to fit and transform.
        Returns:
        np.array: The scaled data.
        """
        return self.fit(X).transform(X)

    def inverse_transform(self, X_scaled):
        """
        Inverse transform the scaled data to original data.
        Args:
        X_scaled (np.array): The scaled data to inverse transform.
        Returns:
        np.array: The original data scale.
        """
        X = (X_scaled - self.minmax_range[0]) / (self.minmax_range[1] - self.minmax_range[0])
        X = X * self.scale_ + self.min_
        return X

def generate_data(df, sequence_length, n_ahead):
    #Case without known inputs
    scaler_df = df.copy().shift(n_ahead).rolling(24 * 14).median()
    tmp_df = df.copy() / scaler_df
    tmp_df = tmp_df.iloc[24 * 14 + n_ahead:].fillna(0.)
    scaler_df = scaler_df.iloc[24 * 14 + n_ahead:].fillna(0.)
    def prepare_sequences(df, scaler_df, n_history, n_future):
        X, y, y_scaler = [], [], []
        num_features = df.shape[1]
        
        # Iterate through the DataFrame to create sequences
        for i in range(n_history, len(df) - n_future + 1):
            # Extract the sequence of past observations
            X.append(df.iloc[i - n_history:i].values)
            # Extract the future values of the first column
            y.append(df.iloc[i:i + n_future,0:1].values)
            y_scaler.append(scaler_df.iloc[i:i + n_future,0:1].values)
        
        X, y, y_scaler = np.array(X), np.array(y), np.array(y_scaler)
        return X, y, y_scaler
    
    # Prepare sequences
    X, y, y_scaler = prepare_sequences(tmp_df, scaler_df, sequence_length, n_ahead)
    
    # Split the dataset into training and testing sets
    train_test_separation = int(len(X) * 0.8)
    X_train_unscaled, X_test_unscaled = X[:train_test_separation], X[train_test_separation:]
    y_train_unscaled, y_test_unscaled = y[:train_test_separation], y[train_test_separation:]
    y_scaler_train, y_scaler_test = y_scaler[:train_test_separation], y_scaler[train_test_separation:]
    
    # Generate the data
    X_scaler = MinMaxScaler(feature_axis=2)
    X_train = X_scaler.fit_transform(X_train_unscaled)
    X_test = X_scaler.transform(X_test_unscaled)
    
    y_scaler = MinMaxScaler(feature_axis=2)
    y_train = y_scaler.fit_transform(y_train_unscaled)
    y_test = y_scaler.transform(y_test_unscaled)
    
    y_train = y_train.reshape(y_train.shape[0], -1) 
    y_test = y_test.reshape(y_test.shape[0], -1)
    return X_scaler, X_train, X_test, X_train_unscaled, X_test_unscaled, y_scaler, y_train, y_test, y_train_unscaled, y_test_unscaled, y_scaler_train, y_scaler_test

def generate_data_w_known_inputs(df, known_input_df, sequence_length, n_ahead):
    #Case without known inputs - fill with 0 the unknown features future values in X
    scaler_df = df.copy().shift(n_ahead).rolling(24 * 14).median()
    tmp_df = df.copy() / scaler_df
    tmp_df = tmp_df.iloc[24 * 14 + n_ahead:].fillna(0.)
    scaler_df = scaler_df.iloc[24 * 14 + n_ahead:].fillna(0.)
    tmp_known_input_df = known_input_df.iloc[24 * 14 + n_ahead:].copy()
    def prepare_sequences(df, known_input_df, scaler_df, n_history, n_future):
        Xu, Xk, y, y_scaler = [], [], [], []
        num_features = df.shape[1]
        
        # Iterate through the DataFrame to create sequences
        for i in range(n_history, len(df) - n_future + 1):
            # Extract the sequence of past observations
            Xu.append(np.concatenate((df.iloc[i - n_history:i].values, np.zeros((n_future, df.shape[1]))), axis=0))
            Xk.append(known_input_df.iloc[i - n_history:i+n_future].values)
            # Extract the future values of the first column
            y.append(df.iloc[i:i + n_future,0:1].values)
            y_scaler.append(scaler_df.iloc[i:i + n_future,0:1].values)
        
        Xu, Xk, y, y_scaler = np.array(Xu), np.array(Xk), np.array(y), np.array(y_scaler)
        return Xu, Xk, y, y_scaler
    
    # Prepare sequences
    Xu, Xk, y, y_scaler = prepare_sequences(tmp_df, tmp_known_input_df, scaler_df, sequence_length, n_ahead)

    X = np.concatenate((Xu, Xk), axis=-1)
    
    # Split the dataset into training and testing sets
    train_test_separation = int(len(X) * 0.8)
    X_train_unscaled, X_test_unscaled = X[:train_test_separation], X[train_test_separation:]
    y_train_unscaled, y_test_unscaled = y[:train_test_separation], y[train_test_separation:]
    y_scaler_train, y_scaler_test = y_scaler[:train_test_separation], y_scaler[train_test_separation:]
    
    # Generate the data
    X_scaler = MinMaxScaler(feature_axis=2)
    X_train = X_scaler.fit_transform(X_train_unscaled)
    X_test = X_scaler.transform(X_test_unscaled)
    
    y_scaler = MinMaxScaler(feature_axis=2)
    y_train = y_scaler.fit_transform(y_train_unscaled)
    y_test = y_scaler.transform(y_test_unscaled)
    
    y_train = y_train.reshape(y_train.shape[0], -1) 
    y_test = y_test.reshape(y_test.shape[0], -1)
    return X_scaler, X_train, X_test, X_train_unscaled, X_test_unscaled, y_scaler, y_train, y_test, y_train_unscaled, y_test_unscaled, y_scaler_train, y_scaler_test

num_hidden = 100
num_heads = 4
num_embedding = 1
n_ahead = 15
sequence_length = 5 * n_ahead

X_scaler, X_train, X_test, X_train_unscaled, X_test_unscaled, y_scaler, y_train, y_test, y_train_unscaled, y_test_unscaled, y_scaler_train, y_scaler_test = generate_data_w_known_inputs(df, known_input_df, sequence_length, n_ahead)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((20691, 90, 21), (5173, 90, 21), (20691, 15), (5173, 15))

Exercise 1: Implementing a Normalization Layer with Addition

In this exercise, you will implement an AddAndNorm layer that combines addition and normalization operations. This layer represents a fundamental building block used in many neural network architectures to combine and normalize signals.

The layer should perform two operations in sequence: 1. Sum its inputs using an addition operation 2. Normalize the result using layer normalization

You need to implement a layer with three key methods: - __init__: Initialize the layer and its parent class - build: Set up the Add and LayerNormalization sublayers - call: Define the forward pass computation

Here is the class structure to complete:

@keras.utils.register_keras_serializable(name="AddAndNorm")
class AddAndNorm(Layer):
    def __init__(self, **kwargs):
        """Initialize the AddAndNorm layer.
        
        Args:
            **kwargs: Additional keyword arguments for the Layer base class
        """
        # TODO: Initialize the parent class using super()
        pass
    
    def build(self, input_shape):
        """Create and configure the Add and LayerNormalization sublayers.
        
        Args:
            input_shape: Shape of the input tensors
        """
        # TODO: Initialize an Add layer
        # TODO: Initialize a LayerNormalization layer
        # Note: The LayerNormalization layer should be built using the output shape from Add
        pass
    
    def call(self, inputs):
        """Perform the forward computation: addition followed by normalization.
        
        Args:
            inputs: Input tensors to be combined
            
        Returns:
            Normalized sum of the inputs
        """
        # TODO: Sum the inputs using the Add layer
        # TODO: Normalize the sum using LayerNormalization
        pass

This layer serves as a basic building block for combining and normalizing signals in neural networks. The normalization helps stabilize the network’s behavior during training.

@keras.utils.register_keras_serializable(name="AddAndNorm")
class AddAndNorm(Layer):
    def __init__(self, **kwargs):
        super(AddAndNorm, self).__init__(**kwargs)

    def build(self, input_shape):
        self.add_layer = Add()
        self.add_layer.build(input_shape)
        self.norm_layer = LayerNormalization()
        self.norm_layer.build(self.add_layer.compute_output_shape(input_shape))
    
    def call(self, inputs):
        tmp = self.add_layer(inputs)
        tmp = self.norm_layer(tmp)
        return tmp

    def compute_output_shape(self, input_shape):
        return input_shape[0]  # Assuming all input shapes are the same

    def get_config(self):
        config = super().get_config()
        return config

Exercise 2: Implementing a Gated Linear Unit (GLU)

In this exercise, you will implement a Gated Linear Unit (GLU), which is a neural network component that controls information flow through learnable gates. The GLU performs element-wise multiplication between a linear transformation of the input and a sigmoid-gated version of the same input.

The core concept is to let the network learn which information should pass through by creating two parallel paths: 1. A linear transformation that produces candidate values 2. A sigmoid-activated transformation that produces gates between 0 and 1

The final output is the element-wise product of these two paths, allowing the network to selectively pass or block information.

Complete the implementation of the Gate layer:

@keras.utils.register_keras_serializable(name="Gate")
class Gate(Layer):
    def __init__(self, hidden_layer_size=None, **kwargs):
        """Initialize the Gate layer.
        
        Args:
            hidden_layer_size: Dimension of the output. If None, uses input dimension
            **kwargs: Additional keyword arguments for the Layer base class
        """
        # TODO: Call parent class initialization
        # TODO: Store hidden_layer_size parameter
        pass
    
    def build(self, input_shape):
        """Create the layer's weights.
        
        Args:
            input_shape: Shape of the input tensor
            
        This method should create:
        1. A Dense layer for the linear transformation (no activation)
        2. A Dense layer with sigmoid activation for the gates
        Both layers should output tensors with hidden_layer_size units
        """
        # TODO: Set hidden_layer_size to input dimension if not specified
        # TODO: Create dense layer for linear transformation
        # TODO: Create dense layer for gates with sigmoid activation
        # TODO: Build both layers with proper input shape
        pass
    
    def call(self, inputs):
        """Forward pass computation.
        
        Args:
            inputs: Input tensor
            
        Returns:
            Element-wise product of the linear transformation and sigmoid gates
        """
        # TODO: Apply linear transformation to create candidate values
        # TODO: Apply sigmoid-activated transformation to create gates
        # TODO: Return element-wise multiplication of candidates and gates
        pass

@keras.utils.register_keras_serializable(name="GRN")
class Gate(Layer):
    def __init__(self, hidden_layer_size = None, **kwargs):
        super(Gate, self).__init__(**kwargs)
        self.hidden_layer_size = hidden_layer_size
        

    def build(self, input_shape):
        if self.hidden_layer_size is None:
            self.hidden_layer_size = input_shape[-1]
        self.dense_layer = Dense(self.hidden_layer_size)
        self.gated_layer = Dense(self.hidden_layer_size, activation='sigmoid')
        self.dense_layer.build(input_shape)
        self.gated_layer.build(input_shape)

    def call(self, inputs):
        dense_output = self.dense_layer(inputs)
        gated_output = self.gated_layer(inputs)
        return ops.multiply(dense_output, gated_output)

    def compute_output_shape(self, input_shape):
        return input_shape[:-1] + (self.hidden_layer_size,)

    def get_config(self):
        config = super().get_config()
        config.update({
            'hidden_layer_size': self.hidden_layer_size,
        })
        return config

When implemented correctly, this layer will: 1. Transform the input through two parallel dense layers 2. One path creates candidate values through a linear transformation 3. The other path creates gates through a sigmoid transformation 4. The final output modulates the candidate values by the computed gates

The gating mechanism allows the network to learn which features are important for the task at hand, effectively acting as a learned feature selector.

Exercise 3: Implementing a Gated Residual Network (GRN)

In this exercise, you will implement a Gated Residual Network layer. The GRN combines several key concepts we’ve covered to create a powerful neural network building block. This architecture processes input through two paths:

A direct path that applies a linear transformation (skip connection)
A main path that applies multiple transformations and gating

The GRN processes its input through these paths and combines them using addition and normalization. This architecture is particularly effective for capturing complex patterns while maintaining stable training dynamics.

Complete the implementation of the GRN layer:

@keras.utils.register_keras_serializable(name="GRN")
class GRN(Layer):
    def __init__(self, hidden_layer_size, output_size=None, **kwargs):
        """Initialize the Gated Residual Network layer.
        
        Args:
            hidden_layer_size: Number of units in the hidden layers
            output_size: Number of output units. If None, matches hidden_layer_size
            **kwargs: Additional keyword arguments for the Layer base class
        """
        # TODO: Call parent class initialization
        # TODO: Store hidden_layer_size and output_size parameters
        pass
    
    def build(self, input_shape):
        """Create and configure all sublayers.
        
        Args:
            input_shape: Shape of the input tensor
        
        This method should initialize:
        1. A skip connection (Dense layer)
        2. Two hidden layers (Dense with ELU and Dense)
        3. A Gate layer
        4. An AddAndNorm layer
        """
        # TODO: Set output_size if not specified
        # TODO: Create skip connection Dense layer
        # TODO: Create first hidden Dense layer with ELU activation
        # TODO: Create second hidden Dense layer
        # TODO: Create Gate layer
        # TODO: Create AddAndNorm layer
        # TODO: Build all layers with appropriate shapes
        pass
    
    def call(self, inputs):
        """Forward pass computation.
        
        Args:
            inputs: Input tensor
            
        Returns:
            Processed tensor combining both paths through addition and normalization
        """
        # TODO: Process input through skip connection
        # TODO: Process input through hidden layers
        # TODO: Apply gating to hidden layer output
        # TODO: Combine and normalize skip connection and gated output
        pass


@keras.utils.register_keras_serializable(name="GRN")
class GRN(Layer):
    def __init__(self, hidden_layer_size, output_size=None, **kwargs):
        super(GRN, self).__init__(**kwargs)
        self.hidden_layer_size = hidden_layer_size
        self.output_size = output_size

    def build(self, input_shape):
        if self.output_size is None:
            self.output_size = self.hidden_layer_size
        self.skip_layer = Dense(self.output_size)
        self.skip_layer.build(input_shape)
        
        self.hidden_layer_1 = Dense(self.hidden_layer_size, activation='elu')
        self.hidden_layer_1.build(input_shape)
        self.hidden_layer_2 = Dense(self.hidden_layer_size)
        self.hidden_layer_2.build((*input_shape[:2], self.hidden_layer_size))
        self.gate_layer = Gate(self.output_size)
        self.gate_layer.build((*input_shape[:2], self.hidden_layer_size))
        self.add_and_norm_layer = AddAndNorm()
        self.add_and_norm_layer.build([(*input_shape[:2], self.output_size),(*input_shape[:2], self.output_size)])

    def call(self, inputs):
        skip = self.skip_layer(inputs)
        hidden = self.hidden_layer_1(inputs)
        hidden = self.hidden_layer_2(hidden)
        gating_output = self.gate_layer(hidden)
        return self.add_and_norm_layer([skip, gating_output])

    def compute_output_shape(self, input_shape):
        return input_shape[:-1] + (self.output_size,)

    def get_config(self):
        config = super().get_config()
        config.update({
            'hidden_layer_size': self.hidden_layer_size,
            'output_size': self.output_size,
        })
        return config

When implemented correctly, this layer will: 1. Transform the input through a skip connection to match the desired output size 2. Process the input through two dense layers with nonlinear activation 3. Apply gating to control information flow 4. Combine the gated output with the skip connection using addition and normalization

The GRN architecture allows the network to learn complex transformations while maintaining good gradient flow and stable training characteristics. The combination of gating and skip connections makes it particularly effective for deep architectures.

Exercise 4: Implementing the Input Embedding Layer

This exercise focuses on creating an embedding layer that handles multivariate time series data in a unique way. Unlike traditional embedding layers used in NLP that map discrete tokens to vectors, this layer learns separate transformations for each continuous feature in our time series.

Let’s understand what this layer does through an example: If we have a time series with 3 features (e.g., price, volume, and volatility) over 100 time steps, our input tensor would have shape (batch_size, 100, 3). For each of these features, we want to learn a separate transformation that captures its unique characteristics. If we set num_hidden to 32, each feature will be transformed into a 32-dimensional representation, while preserving its temporal structure.

Complete the implementation below:

@keras.utils.register_keras_serializable(name="EmbeddingLayer")
class EmbeddingLayer(Layer):
    def __init__(self, num_hidden, **kwargs):
        """Initialize the Embedding Layer.
        
        Args:
            num_hidden: Dimension of the transformed representation for each feature
            **kwargs: Additional keyword arguments for the Layer base class
        """
        # TODO: Call parent class initialization
        # TODO: Store the embedding dimension parameter
        pass
    
    def build(self, input_shape):
        """Create independent transformation layers for each input feature.
        
        For a multivariate time series with n features, this method creates
        n independent dense layers. Each layer will transform its respective
        feature from a scalar to a vector of size num_hidden.
        
        Args:
            input_shape: Tuple (batch_size, time_steps, n_features)
        """
        # TODO: Create a dense layer for each feature in the time series
        # TODO: Configure each dense layer to handle its specific feature
        pass
    
    def call(self, inputs):
        """Transform each feature independently.
        
        For each time series feature, this method:
        1. Extracts the feature across all time steps
        2. Applies its dedicated transformation
        3. Stacks all transformed features together
        
        Args:
            inputs: Tensor of shape (batch_size, time_steps, n_features)
            
        Returns:
            Tensor of shape (batch_size, time_steps, num_hidden, n_features)
        """
        # TODO: Process each feature through its dedicated dense layer
        # TODO: Stack transformed features into final output tensor
        pass

@keras.utils.register_keras_serializable(name="EmbeddingLayer")
class EmbeddingLayer(Layer):
    def __init__(self, num_hidden, **kwargs):
        super(EmbeddingLayer, self).__init__(**kwargs)
        self.num_hidden = num_hidden

    def build(self, input_shape):
        self.dense_layers = [
            Dense(self.num_hidden) for _ in range(input_shape[-1])
        ]
        for i in range(input_shape[-1]):
            self.dense_layers[i].build((*input_shape[:2], 1))
        super(EmbeddingLayer, self).build(input_shape)

    def call(self, inputs):
        embeddings = [dense_layer(inputs[:, :, i:i+1]) for i, dense_layer in enumerate(self.dense_layers)]
        return ops.stack(embeddings, axis=-1)

    def compute_output_shape(self, input_shape):
        return input_shape[:-1] + (self.num_hidden, input_shape[-1])

    def get_config(self):
        config = super().get_config()
        config.update({
            'num_hidden': self.num_hidden,
        })
        return config

The key insight of this layer is that it preserves feature independence while enriching their representations. Unlike a standard dense layer that would mix all features together, this layer ensures that each feature’s temporal patterns are processed separately before any interaction between features occurs in subsequent layers. This design is particularly valuable for time series data where different features may have very different statistical properties and temporal dynamics.

Part 2: Building the Complete Model

Now that we have our building blocks, let’s create the complete model using Keras’s Sequential API. We’ll provide you with the VariableSelectionNetwork implementation, and you’ll need to assemble everything together.

Here’s the variable selection network we’ll use:

@keras.utils.register_keras_serializable(name="VariableSelectionNetwork")
class VariableSelectionNetwork(Layer):
    def __init__(self, num_hidden, **kwargs):
        super(VariableSelectionNetwork, self).__init__(**kwargs)
        self.num_hidden = num_hidden

    def build(self, input_shape):
        batch_size, time_steps, embedding_dim, num_inputs = input_shape
        self.softmax = Activation('softmax')
        self.num_inputs = num_inputs
        self.flatten_dim = time_steps * embedding_dim * num_inputs
        self.reshape_layer = Reshape(target_shape=[time_steps, embedding_dim * num_inputs])
        self.reshape_layer.build(input_shape)
        self.mlp_dense = GRN(hidden_layer_size = self.num_hidden, output_size=num_inputs)
        self.mlp_dense.build((batch_size, time_steps, embedding_dim * num_inputs))
        self.grn_layers = [GRN(self.num_hidden) for _ in range(num_inputs)]
        for i in range(num_inputs):
            self.grn_layers[i].build(input_shape[:3])
        super(VariableSelectionNetwork, self).build(input_shape)

    def call(self, inputs):
        _, time_steps, embedding_dim, num_inputs = inputs.shape
        flatten = self.reshape_layer(inputs)
        # Variable selection weights
        mlp_outputs = self.mlp_dense(flatten)
        sparse_weights = ops.softmax(mlp_outputs)
        sparse_weights = ops.expand_dims(sparse_weights, axis=2)
        
        # Non-linear Processing & weight application
        trans_emb_list = []
        for i in range(num_inputs):
            grn_output = self.grn_layers[i](inputs[:, :, :, i])
            trans_emb_list.append(grn_output)
        
        transformed_embedding = ops.stack(trans_emb_list, axis=-1)
        combined = ops.multiply(sparse_weights, transformed_embedding)
        temporal_ctx = ops.sum(combined, axis=-1)
        
        return temporal_ctx

    def get_config(self):
        config = super().get_config()
        config.update({
            'num_hidden': self.num_hidden,
        })
        return config

Exercise 5: Building a Transformer-Based Model for Time Series Forecasting

This exercise focuses on creating a transformer architecture specifically designed for time series forecasting. The model has a unique characteristic: it handles both “known” and “unknown” features differently in its predictions.

Let’s understand the key concepts:

Known vs Unknown Features The model processes two types of features: - Unknown features: Values that are only available up to the present (like stock prices or trading volumes) - Known features: Values that are available for future time points (like hour of day or day of week)

Time Windows The model works with two temporal segments: - Historical window: Past observations of length sequence_length - Future window: Future period of length n_ahead where we want to make predictions

Encoder-Decoder Architecture The architecture uses an encoder-decoder pattern with LSTM networks: - Encoder: Processes the historical data to create a representation of past patterns - Decoder: Uses this representation along with known future features to generate predictions

Complete this implementation that assembles all the components we’ve built:

def create_transformer_model(sequence_length, num_unknow_features, num_know_features, 
                           num_embedding, num_hidden, num_heads, n_ahead):
    """
    Build a transformer-based forecasting model.
    
    The model processes the input through several stages:
    1. Feature embedding and separation
    2. Variable selection for past and future sequences
    3. LSTM-based encoding and decoding
    4. Attention mechanism for temporal patterns
    5. Final prediction generation
    
    Args:
        sequence_length: Number of past time steps to use
        num_unknow_features: Number of observed features (e.g., trading volumes)
        num_know_features: Number of known features (e.g., hour, day of week)
        num_embedding: Size of the embedding dimension
        num_hidden: Number of hidden units in layers
        num_heads: Number of attention heads
        n_ahead: Number of future time steps to predict
    
    Returns:
        keras.Model: Assembled transformer model
    """
    # TODO: Create input layer that accepts both past and future time steps
    
    # TODO: Apply embedding to transform all features
    
    # TODO: Split embedded features into past and future sequences
    # Note: Future sequence only includes known features
    
    # TODO: Apply variable selection to both sequences independently
    
    # TODO: Create and apply LSTM encoder to past sequence
    # Important: Capture encoder states for decoder initialization
    
    # TODO: Create and apply LSTM decoder to future sequence
    # Important: Initialize with encoder states
    
    # TODO: Combine encoder and decoder outputs to maintain full history
    
    # TODO: Add gated residual connection between LSTM and variable selection paths
    
    # TODO: Apply GRN transformation
    
    # TODO: Apply multi-head attention to capture temporal dependencies
    
    # TODO: Create final prediction layer


def create_transformer_model(sequence_length, num_unknow_features, num_know_features, 
                           num_embedding, num_hidden, num_heads, n_ahead):
    """
    Creates a transformer-based model for time series forecasting.
    
    Args:
        sequence_length: Number of past time steps to use
        num_unknow_features: Number of observed features (e.g., trading volumes)
        num_know_features: Number of known features (e.g., hour, day of week)
        num_embedding: Size of the embedding dimension
        num_hidden: Number of hidden units in layers
        num_heads: Number of attention heads
        n_ahead: Number of future time steps to predict
    
    Returns:
        keras.Model: Assembled transformer model
    """
    

    inputs = Input(shape=(sequence_length+n_ahead, num_unknow_features + num_know_features))

    embedded_inputs = EmbeddingLayer(num_embedding, name = 'embedding_layer')(inputs)

    past_features = Lambda(lambda x: x[:, :sequence_length, :, :], name='past_observed_and_known')(embedded_inputs)
    variable_selection_past = VariableSelectionNetwork(num_hidden, name='vsn_past_features')(past_features)

    future_features = Lambda(lambda x: x[:,sequence_length:,:,-num_know_features:], name='future_known')(embedded_inputs)
    variable_selection_future = VariableSelectionNetwork(num_hidden, name='vsn_future_features')(future_features)

    
    # Encoder-Decoder with LSTM
    encoder = LSTM(num_hidden, return_sequences=True, return_state=True, name='encoder')
    encode_out, *encode_states = encoder(variable_selection_past)
    
    decoder = LSTM(num_hidden, return_sequences=True, name='decoder')
    decode_out = decoder(variable_selection_future, 
                        initial_state=encode_states)
    
    # Combine encoder and decoder outputs
    history = Concatenate(axis=1)([encode_out, decode_out])
    
    # Add residual connection with gating
    selected = Concatenate(axis=1)([variable_selection_past, 
                                   variable_selection_future])
    all_context = AddAndNorm()([Gate()(history), selected])
    
    # Apply GRN before attention
    enriched = GRN(num_hidden)(all_context)
    
    # Multi-head attention
    attention_output = MultiHeadAttention(
        num_heads=num_heads, 
        key_dim=enriched.shape[-1]
    )(enriched, enriched, enriched)
    
    # Final prediction layers
    flattened_output = Flatten()(attention_output)
    outputs = Dense(n_ahead)(flattened_output)
    
    return Model(inputs=inputs, outputs=outputs)

Exercise 6: Training and Evaluation

Now let’s put everything together and train the model. Complete this code to: 1. Create the model with appropriate parameters 2. Compile it with suitable loss and optimizer 3. Train it on our cryptocurrency data 4. Evaluate its performance

# TODO: Create and compile model

# TODO: Train model

# TODO: Evaluate and visualize results


# Create model
model = create_transformer_model(
    sequence_length=sequence_length,
    num_unknow_features=len(assets),  # Number of crypto assets
    num_know_features=2,              # Hour and day of week
    num_embedding=num_embedding,
    num_hidden=num_hidden,
    num_heads=num_heads,
    n_ahead=n_ahead
)

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse',
    metrics=['mae'],
    jit_compile=True
)

model.summary()

# Create callbacks
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6
    )
]

# Train model
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=256,
    callbacks=callbacks,
    verbose=1
)

# Evaluate model
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss[0]:.4f}")
print(f"Test MAE: {test_loss[1]:.4f}")

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ input_layer         │ (None, 90, 21)    │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ embedding_layer     │ (None, 90, 1, 21) │         42 │ input_layer[0][0] │
│ (EmbeddingLayer)    │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ past_observed_and_… │ (None, 75, 1, 21) │          0 │ embedding_layer[… │
│ (Lambda)            │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ vsn_past_features   │ (None, 75, 100)   │    665,946 │ past_observed_an… │
│ (VariableSelection… │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ future_known        │ (None, 15, 1, 2)  │          0 │ embedding_layer[… │
│ (Lambda)            │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ encoder (LSTM)      │ [(None, 75, 100), │     80,400 │ vsn_past_feature… │
│                     │ (None, 100),      │            │                   │
│                     │ (None, 100)]      │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ vsn_future_features │ (None, 15, 100)   │     72,614 │ future_known[0][… │
│ (VariableSelection… │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ decoder (LSTM)      │ (None, 15, 100)   │     80,400 │ vsn_future_featu… │
│                     │                   │            │ encoder[0][1],    │
│                     │                   │            │ encoder[0][2]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate         │ (None, 90, 100)   │          0 │ encoder[0][0],    │
│ (Concatenate)       │                   │            │ decoder[0][0]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ gate_25 (Gate)      │ (None, 90, 100)   │     20,200 │ concatenate[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate_1       │ (None, 90, 100)   │          0 │ vsn_past_feature… │
│ (Concatenate)       │                   │            │ vsn_future_featu… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ add_and_norm_25     │ (None, 90, 100)   │        200 │ gate_25[0][0],    │
│ (AddAndNorm)        │                   │            │ concatenate_1[0]… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ grn_25 (GRN)        │ (None, 90, 100)   │     50,700 │ add_and_norm_25[… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ multi_head_attenti… │ (None, 90, 100)   │    161,300 │ grn_25[0][0],     │
│ (MultiHeadAttentio… │                   │            │ grn_25[0][0],     │
│                     │                   │            │ grn_25[0][0]      │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ flatten (Flatten)   │ (None, 9000)      │          0 │ multi_head_atten… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_153 (Dense)   │ (None, 15)        │    135,015 │ flatten[0][0]     │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘

 Total params: 1,266,817 (4.83 MB)

 Trainable params: 1,266,817 (4.83 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 115s 2s/step - loss: 0.1638 - mae: 0.2396 - val_loss: 0.0040 - val_mae: 0.0408 - learning_rate: 0.0010
Epoch 2/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 100s 1s/step - loss: 0.0033 - mae: 0.0355 - val_loss: 0.0039 - val_mae: 0.0375 - learning_rate: 0.0010
Epoch 3/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0031 - mae: 0.0339 - val_loss: 0.0037 - val_mae: 0.0392 - learning_rate: 0.0010
Epoch 4/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0030 - mae: 0.0336 - val_loss: 0.0038 - val_mae: 0.0365 - learning_rate: 0.0010
Epoch 5/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 94s 1s/step - loss: 0.0032 - mae: 0.0340 - val_loss: 0.0036 - val_mae: 0.0381 - learning_rate: 0.0010
Epoch 6/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0031 - mae: 0.0335 - val_loss: 0.0037 - val_mae: 0.0368 - learning_rate: 0.0010
Epoch 7/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0031 - mae: 0.0332 - val_loss: 0.0036 - val_mae: 0.0350 - learning_rate: 0.0010
Epoch 8/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 91s 1s/step - loss: 0.0031 - mae: 0.0330 - val_loss: 0.0035 - val_mae: 0.0359 - learning_rate: 0.0010
Epoch 9/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0030 - mae: 0.0329 - val_loss: 0.0036 - val_mae: 0.0362 - learning_rate: 0.0010
Epoch 10/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0029 - mae: 0.0321 - val_loss: 0.0035 - val_mae: 0.0401 - learning_rate: 0.0010
Epoch 11/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0031 - mae: 0.0329 - val_loss: 0.0035 - val_mae: 0.0383 - learning_rate: 0.0010
Epoch 12/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0029 - mae: 0.0323 - val_loss: 0.0034 - val_mae: 0.0392 - learning_rate: 0.0010
Epoch 13/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0028 - mae: 0.0321 - val_loss: 0.0035 - val_mae: 0.0330 - learning_rate: 0.0010
Epoch 14/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 97s 1s/step - loss: 0.0027 - mae: 0.0308 - val_loss: 0.0033 - val_mae: 0.0361 - learning_rate: 0.0010
Epoch 15/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0029 - mae: 0.0315 - val_loss: 0.0035 - val_mae: 0.0331 - learning_rate: 0.0010
Epoch 16/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0028 - mae: 0.0314 - val_loss: 0.0033 - val_mae: 0.0376 - learning_rate: 0.0010
Epoch 17/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 94s 1s/step - loss: 0.0026 - mae: 0.0305 - val_loss: 0.0033 - val_mae: 0.0342 - learning_rate: 0.0010
Epoch 18/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0028 - mae: 0.0309 - val_loss: 0.0033 - val_mae: 0.0340 - learning_rate: 0.0010
Epoch 19/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0027 - mae: 0.0302 - val_loss: 0.0033 - val_mae: 0.0391 - learning_rate: 0.0010
Epoch 20/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0027 - mae: 0.0309 - val_loss: 0.0032 - val_mae: 0.0357 - learning_rate: 0.0010
Epoch 21/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 97s 1s/step - loss: 0.0027 - mae: 0.0302 - val_loss: 0.0032 - val_mae: 0.0364 - learning_rate: 0.0010
Epoch 22/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 101s 1s/step - loss: 0.0026 - mae: 0.0303 - val_loss: 0.0032 - val_mae: 0.0367 - learning_rate: 5.0000e-04
Epoch 23/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0027 - mae: 0.0304 - val_loss: 0.0033 - val_mae: 0.0332 - learning_rate: 5.0000e-04
Epoch 24/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0339 - learning_rate: 5.0000e-04
Epoch 25/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0033 - val_mae: 0.0328 - learning_rate: 5.0000e-04
Epoch 26/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 5.0000e-04
Epoch 27/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0032 - val_mae: 0.0343 - learning_rate: 2.5000e-04
Epoch 28/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 91s 1s/step - loss: 0.0026 - mae: 0.0297 - val_loss: 0.0032 - val_mae: 0.0363 - learning_rate: 2.5000e-04
Epoch 29/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0027 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0356 - learning_rate: 2.5000e-04
Epoch 30/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 92s 1s/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0336 - learning_rate: 2.5000e-04
Epoch 31/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 92s 1s/step - loss: 0.0026 - mae: 0.0296 - val_loss: 0.0032 - val_mae: 0.0333 - learning_rate: 2.5000e-04
Epoch 32/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 1.2500e-04
Epoch 33/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 95s 1s/step - loss: 0.0026 - mae: 0.0295 - val_loss: 0.0032 - val_mae: 0.0345 - learning_rate: 1.2500e-04
Epoch 34/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 96s 1s/step - loss: 0.0026 - mae: 0.0297 - val_loss: 0.0032 - val_mae: 0.0340 - learning_rate: 1.2500e-04
Epoch 35/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 94s 1s/step - loss: 0.0025 - mae: 0.0294 - val_loss: 0.0033 - val_mae: 0.0335 - learning_rate: 1.2500e-04
Epoch 36/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 93s 1s/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 1.2500e-04
162/162 ━━━━━━━━━━━━━━━━━━━━ 11s 65ms/step - loss: 0.0075 - mae: 0.0365 
Test Loss: 0.0044
Test MAE: 0.0289

Now compare this to a simple two layer LSTM model

# Create model
model = Sequential([
    Input(shape=X_train.shape[1:]),
    LSTM(100, return_sequences=True),
    LSTM(100, return_sequences=False),
    Dense(units=n_ahead, activation='linear')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse',
    metrics=['mae'],
    jit_compile=True
)

model.summary()

# Create callbacks
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6
    )
]

# Train model
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=256,
    callbacks=callbacks,
    verbose=1
)

# Evaluate model
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss[0]:.4f}")
print(f"Test MAE: {test_loss[1]:.4f}")

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm (LSTM)                     │ (None, 90, 100)        │        48,800 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_1 (LSTM)                   │ (None, 100)            │        80,400 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_154 (Dense)               │ (None, 15)             │         1,515 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 130,715 (510.61 KB)

 Trainable params: 130,715 (510.61 KB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 257ms/step - loss: 0.0039 - mae: 0.0392 - val_loss: 0.0034 - val_mae: 0.0403 - learning_rate: 0.0010
Epoch 2/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 242ms/step - loss: 0.0029 - mae: 0.0318 - val_loss: 0.0033 - val_mae: 0.0353 - learning_rate: 0.0010
Epoch 3/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 239ms/step - loss: 0.0028 - mae: 0.0312 - val_loss: 0.0033 - val_mae: 0.0363 - learning_rate: 0.0010
Epoch 4/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 259ms/step - loss: 0.0028 - mae: 0.0311 - val_loss: 0.0033 - val_mae: 0.0345 - learning_rate: 0.0010
Epoch 5/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 259ms/step - loss: 0.0027 - mae: 0.0309 - val_loss: 0.0033 - val_mae: 0.0362 - learning_rate: 0.0010
Epoch 6/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 258ms/step - loss: 0.0028 - mae: 0.0308 - val_loss: 0.0033 - val_mae: 0.0340 - learning_rate: 0.0010
Epoch 7/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 256ms/step - loss: 0.0028 - mae: 0.0308 - val_loss: 0.0033 - val_mae: 0.0362 - learning_rate: 0.0010
Epoch 8/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 244ms/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0033 - val_mae: 0.0363 - learning_rate: 5.0000e-04
Epoch 9/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 237ms/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0034 - val_mae: 0.0380 - learning_rate: 5.0000e-04
Epoch 10/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 234ms/step - loss: 0.0027 - mae: 0.0305 - val_loss: 0.0033 - val_mae: 0.0362 - learning_rate: 5.0000e-04
Epoch 11/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 245ms/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0033 - val_mae: 0.0371 - learning_rate: 5.0000e-04
Epoch 12/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 248ms/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0033 - val_mae: 0.0330 - learning_rate: 5.0000e-04
Epoch 13/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 243ms/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0033 - val_mae: 0.0362 - learning_rate: 2.5000e-04
Epoch 14/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 249ms/step - loss: 0.0025 - mae: 0.0297 - val_loss: 0.0033 - val_mae: 0.0348 - learning_rate: 2.5000e-04
Epoch 15/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 239ms/step - loss: 0.0025 - mae: 0.0295 - val_loss: 0.0033 - val_mae: 0.0336 - learning_rate: 2.5000e-04
Epoch 16/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 257ms/step - loss: 0.0025 - mae: 0.0292 - val_loss: 0.0032 - val_mae: 0.0358 - learning_rate: 2.5000e-04
Epoch 17/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 257ms/step - loss: 0.0025 - mae: 0.0298 - val_loss: 0.0033 - val_mae: 0.0339 - learning_rate: 2.5000e-04
Epoch 18/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 228ms/step - loss: 0.0025 - mae: 0.0295 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 1.2500e-04
Epoch 19/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 236ms/step - loss: 0.0026 - mae: 0.0297 - val_loss: 0.0033 - val_mae: 0.0338 - learning_rate: 1.2500e-04
Epoch 20/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 247ms/step - loss: 0.0025 - mae: 0.0293 - val_loss: 0.0033 - val_mae: 0.0357 - learning_rate: 1.2500e-04
Epoch 21/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 227ms/step - loss: 0.0025 - mae: 0.0295 - val_loss: 0.0033 - val_mae: 0.0356 - learning_rate: 1.2500e-04
Epoch 22/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 229ms/step - loss: 0.0025 - mae: 0.0296 - val_loss: 0.0033 - val_mae: 0.0346 - learning_rate: 1.2500e-04
Epoch 23/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 225ms/step - loss: 0.0024 - mae: 0.0291 - val_loss: 0.0033 - val_mae: 0.0348 - learning_rate: 6.2500e-05
Epoch 24/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 14s 219ms/step - loss: 0.0024 - mae: 0.0290 - val_loss: 0.0033 - val_mae: 0.0349 - learning_rate: 6.2500e-05
Epoch 25/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 14s 218ms/step - loss: 0.0025 - mae: 0.0293 - val_loss: 0.0033 - val_mae: 0.0352 - learning_rate: 6.2500e-05
Epoch 26/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 14s 222ms/step - loss: 0.0024 - mae: 0.0292 - val_loss: 0.0033 - val_mae: 0.0351 - learning_rate: 6.2500e-05
Epoch 27/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 14s 220ms/step - loss: 0.0024 - mae: 0.0289 - val_loss: 0.0033 - val_mae: 0.0347 - learning_rate: 6.2500e-05
Epoch 28/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 230ms/step - loss: 0.0024 - mae: 0.0292 - val_loss: 0.0034 - val_mae: 0.0352 - learning_rate: 3.1250e-05
162/162 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.0088 - mae: 0.0389
Test Loss: 0.0051
Test MAE: 0.0298

# Create model
model = Sequential([
    Input(shape=X_train.shape[1:]),
    LSTM(100, return_sequences=True),
    LSTM(100, return_sequences=True),
    keras.layers.Flatten(),
    Dense(units=n_ahead, activation='linear')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse',
    metrics=['mae'],
    jit_compile=True
)

model.summary()

# Create callbacks
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6
    )
]

# Train model
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=256,
    callbacks=callbacks,
    verbose=1
)

# Evaluate model
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss[0]:.4f}")
print(f"Test MAE: {test_loss[1]:.4f}")

Model: "sequential_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_2 (LSTM)                   │ (None, 90, 100)        │        48,800 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_3 (LSTM)                   │ (None, 90, 100)        │        80,400 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten_1 (Flatten)             │ (None, 9000)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_155 (Dense)               │ (None, 15)             │       135,015 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 264,215 (1.01 MB)

 Trainable params: 264,215 (1.01 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 17s 263ms/step - loss: 0.0066 - mae: 0.0542 - val_loss: 0.0034 - val_mae: 0.0366 - learning_rate: 0.0010
Epoch 2/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 234ms/step - loss: 0.0028 - mae: 0.0313 - val_loss: 0.0033 - val_mae: 0.0346 - learning_rate: 0.0010
Epoch 3/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 232ms/step - loss: 0.0027 - mae: 0.0309 - val_loss: 0.0033 - val_mae: 0.0367 - learning_rate: 0.0010
Epoch 4/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 237ms/step - loss: 0.0027 - mae: 0.0306 - val_loss: 0.0033 - val_mae: 0.0345 - learning_rate: 0.0010
Epoch 5/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 250ms/step - loss: 0.0027 - mae: 0.0303 - val_loss: 0.0033 - val_mae: 0.0365 - learning_rate: 0.0010
Epoch 6/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 238ms/step - loss: 0.0027 - mae: 0.0302 - val_loss: 0.0033 - val_mae: 0.0357 - learning_rate: 0.0010
Epoch 7/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 241ms/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0033 - val_mae: 0.0366 - learning_rate: 0.0010
Epoch 8/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 244ms/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0033 - val_mae: 0.0349 - learning_rate: 0.0010
Epoch 9/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 239ms/step - loss: 0.0027 - mae: 0.0302 - val_loss: 0.0033 - val_mae: 0.0345 - learning_rate: 0.0010
Epoch 10/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 249ms/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 5.0000e-04
Epoch 11/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 242ms/step - loss: 0.0027 - mae: 0.0300 - val_loss: 0.0033 - val_mae: 0.0364 - learning_rate: 5.0000e-04
Epoch 12/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 227ms/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 5.0000e-04
Epoch 13/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 231ms/step - loss: 0.0026 - mae: 0.0297 - val_loss: 0.0032 - val_mae: 0.0362 - learning_rate: 5.0000e-04
Epoch 14/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 233ms/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 5.0000e-04
Epoch 15/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 231ms/step - loss: 0.0025 - mae: 0.0293 - val_loss: 0.0032 - val_mae: 0.0343 - learning_rate: 2.5000e-04
Epoch 16/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 234ms/step - loss: 0.0025 - mae: 0.0291 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 2.5000e-04
Epoch 17/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 237ms/step - loss: 0.0025 - mae: 0.0293 - val_loss: 0.0032 - val_mae: 0.0346 - learning_rate: 2.5000e-04
Epoch 18/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 230ms/step - loss: 0.0025 - mae: 0.0292 - val_loss: 0.0032 - val_mae: 0.0335 - learning_rate: 2.5000e-04
Epoch 19/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 235ms/step - loss: 0.0026 - mae: 0.0294 - val_loss: 0.0032 - val_mae: 0.0360 - learning_rate: 2.5000e-04
Epoch 20/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 231ms/step - loss: 0.0025 - mae: 0.0290 - val_loss: 0.0032 - val_mae: 0.0348 - learning_rate: 1.2500e-04
Epoch 21/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 235ms/step - loss: 0.0024 - mae: 0.0289 - val_loss: 0.0032 - val_mae: 0.0344 - learning_rate: 1.2500e-04
Epoch 22/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 240ms/step - loss: 0.0025 - mae: 0.0293 - val_loss: 0.0032 - val_mae: 0.0346 - learning_rate: 1.2500e-04
Epoch 23/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 245ms/step - loss: 0.0025 - mae: 0.0292 - val_loss: 0.0032 - val_mae: 0.0341 - learning_rate: 1.2500e-04
Epoch 24/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 235ms/step - loss: 0.0024 - mae: 0.0291 - val_loss: 0.0033 - val_mae: 0.0340 - learning_rate: 1.2500e-04
Epoch 25/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 235ms/step - loss: 0.0025 - mae: 0.0290 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 6.2500e-05
Epoch 26/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 239ms/step - loss: 0.0025 - mae: 0.0291 - val_loss: 0.0032 - val_mae: 0.0342 - learning_rate: 6.2500e-05
Epoch 27/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 235ms/step - loss: 0.0025 - mae: 0.0289 - val_loss: 0.0032 - val_mae: 0.0342 - learning_rate: 6.2500e-05
Epoch 28/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 236ms/step - loss: 0.0024 - mae: 0.0288 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 6.2500e-05
Epoch 29/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 15s 232ms/step - loss: 0.0024 - mae: 0.0286 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 6.2500e-05
Epoch 30/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 16s 244ms/step - loss: 0.0024 - mae: 0.0287 - val_loss: 0.0032 - val_mae: 0.0352 - learning_rate: 3.1250e-05
162/162 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 0.0093 - mae: 0.0395
Test Loss: 0.0055
Test MAE: 0.0304

Finally, install the tkan package and replace the LSTM layer by a TKAN one

def create_transformer_model(sequence_length, num_unknow_features, num_know_features, 
                           num_embedding, num_hidden, num_heads, n_ahead):
    """
    Creates a transformer-based model for time series forecasting.
    
    Args:
        sequence_length: Number of past time steps to use
        num_unknow_features: Number of observed features (e.g., trading volumes)
        num_know_features: Number of known features (e.g., hour, day of week)
        num_embedding: Size of the embedding dimension
        num_hidden: Number of hidden units in layers
        num_heads: Number of attention heads
        n_ahead: Number of future time steps to predict
    
    Returns:
        keras.Model: Assembled transformer model
    """
    

    inputs = Input(shape=(sequence_length+n_ahead, num_unknow_features + num_know_features))

    embedded_inputs = EmbeddingLayer(num_embedding, name = 'embedding_layer')(inputs)

    past_features = Lambda(lambda x: x[:, :sequence_length, :, :], name='past_observed_and_known')(embedded_inputs)
    variable_selection_past = VariableSelectionNetwork(num_hidden, name='vsn_past_features')(past_features)

    future_features = Lambda(lambda x: x[:,sequence_length:,:,-num_know_features:], name='future_known')(embedded_inputs)
    variable_selection_future = VariableSelectionNetwork(num_hidden, name='vsn_future_features')(future_features)

    
    # Encoder-Decoder with LSTM
    encoder = TKAN(num_hidden, return_sequences=True, return_state=True, name='encoder')
    encode_out, *encode_states = encoder(variable_selection_past)
    
    decoder = TKAN(num_hidden, return_sequences=True, name='decoder')
    decode_out = decoder(variable_selection_future, 
                        initial_state=encode_states)
    
    # Combine encoder and decoder outputs
    history = Concatenate(axis=1)([encode_out, decode_out])
    
    # Add residual connection with gating
    selected = Concatenate(axis=1)([variable_selection_past, 
                                   variable_selection_future])
    all_context = AddAndNorm()([Gate()(history), selected])
    
    # Apply GRN before attention
    enriched = GRN(num_hidden)(all_context)
    
    # Multi-head attention
    attention_output = MultiHeadAttention(
        num_heads=num_heads, 
        key_dim=enriched.shape[-1]
    )(enriched, enriched, enriched)
    
    # Final prediction layers
    flattened_output = Flatten()(attention_output)
    outputs = Dense(n_ahead)(flattened_output)
    
    return Model(inputs=inputs, outputs=outputs)

model = create_transformer_model(
    sequence_length=sequence_length,
    num_unknow_features=len(assets),  # Number of crypto assets
    num_know_features=2,              # Hour and day of week
    num_embedding=num_embedding,
    num_hidden=num_hidden,
    num_heads=num_heads,
    n_ahead=n_ahead
)

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse',
    metrics=['mae'],
    jit_compile=True
)

model.summary()

# Create callbacks
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6
    )
]

# Train model
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=256,
    callbacks=callbacks,
    verbose=1
)

# Evaluate model
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss[0]:.4f}")
print(f"Test MAE: {test_loss[1]:.4f}")

Model: "functional_3"

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ input_layer_3       │ (None, 90, 21)    │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ embedding_layer     │ (None, 90, 1, 21) │         42 │ input_layer_3[0]… │
│ (EmbeddingLayer)    │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ past_observed_and_… │ (None, 75, 1, 21) │          0 │ embedding_layer[… │
│ (Lambda)            │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ vsn_past_features   │ (None, 75, 100)   │    665,946 │ past_observed_an… │
│ (VariableSelection… │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ future_known        │ (None, 15, 1, 2)  │          0 │ embedding_layer[… │
│ (Lambda)            │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ encoder (TKAN)      │ [(None, 75, 100), │    161,900 │ vsn_past_feature… │
│                     │ (None, 100),      │            │                   │
│                     │ (None, 100),      │            │                   │
│                     │ (None, 1)]        │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ vsn_future_features │ (None, 15, 100)   │     72,614 │ future_known[0][… │
│ (VariableSelection… │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ decoder (TKAN)      │ (None, 15, 100)   │    161,900 │ vsn_future_featu… │
│                     │                   │            │ encoder[0][1],    │
│                     │                   │            │ encoder[0][2],    │
│                     │                   │            │ encoder[0][3]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate_2       │ (None, 90, 100)   │          0 │ encoder[0][0],    │
│ (Concatenate)       │                   │            │ decoder[0][0]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ gate_52 (Gate)      │ (None, 90, 100)   │     20,200 │ concatenate_2[0]… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate_3       │ (None, 90, 100)   │          0 │ vsn_past_feature… │
│ (Concatenate)       │                   │            │ vsn_future_featu… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ add_and_norm_52     │ (None, 90, 100)   │        200 │ gate_52[0][0],    │
│ (AddAndNorm)        │                   │            │ concatenate_3[0]… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ grn_51 (GRN)        │ (None, 90, 100)   │     50,700 │ add_and_norm_52[… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ multi_head_attenti… │ (None, 90, 100)   │    161,300 │ grn_51[0][0],     │
│ (MultiHeadAttentio… │                   │            │ grn_51[0][0],     │
│                     │                   │            │ grn_51[0][0]      │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ flatten_2 (Flatten) │ (None, 9000)      │          0 │ multi_head_atten… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_309 (Dense)   │ (None, 15)        │    135,015 │ flatten_2[0][0]   │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘

 Total params: 1,429,817 (5.45 MB)

 Trainable params: 1,427,817 (5.45 MB)

 Non-trainable params: 2,000 (7.81 KB)

Epoch 1/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 154s 2s/step - loss: 0.1629 - mae: 0.2282 - val_loss: 0.0040 - val_mae: 0.0412 - learning_rate: 0.0010
Epoch 2/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 134s 2s/step - loss: 0.0034 - mae: 0.0361 - val_loss: 0.0039 - val_mae: 0.0399 - learning_rate: 0.0010
Epoch 3/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 128s 2s/step - loss: 0.0034 - mae: 0.0356 - val_loss: 0.0041 - val_mae: 0.0401 - learning_rate: 0.0010
Epoch 4/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0034 - mae: 0.0362 - val_loss: 0.0039 - val_mae: 0.0389 - learning_rate: 0.0010
Epoch 5/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0032 - mae: 0.0346 - val_loss: 0.0038 - val_mae: 0.0403 - learning_rate: 0.0010
Epoch 6/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0032 - mae: 0.0349 - val_loss: 0.0038 - val_mae: 0.0385 - learning_rate: 0.0010
Epoch 7/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 128s 2s/step - loss: 0.0032 - mae: 0.0345 - val_loss: 0.0038 - val_mae: 0.0374 - learning_rate: 0.0010
Epoch 8/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0032 - mae: 0.0346 - val_loss: 0.0037 - val_mae: 0.0380 - learning_rate: 0.0010
Epoch 9/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 126s 2s/step - loss: 0.0032 - mae: 0.0343 - val_loss: 0.0037 - val_mae: 0.0377 - learning_rate: 0.0010
Epoch 10/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0032 - mae: 0.0342 - val_loss: 0.0037 - val_mae: 0.0386 - learning_rate: 0.0010
Epoch 11/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0031 - mae: 0.0339 - val_loss: 0.0036 - val_mae: 0.0371 - learning_rate: 5.0000e-04
Epoch 12/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 127s 2s/step - loss: 0.0031 - mae: 0.0335 - val_loss: 0.0035 - val_mae: 0.0366 - learning_rate: 5.0000e-04
Epoch 13/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 128s 2s/step - loss: 0.0029 - mae: 0.0325 - val_loss: 0.0035 - val_mae: 0.0365 - learning_rate: 5.0000e-04
Epoch 14/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0031 - mae: 0.0336 - val_loss: 0.0036 - val_mae: 0.0360 - learning_rate: 5.0000e-04
Epoch 15/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0030 - mae: 0.0323 - val_loss: 0.0035 - val_mae: 0.0396 - learning_rate: 5.0000e-04
Epoch 16/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 126s 2s/step - loss: 0.0030 - mae: 0.0328 - val_loss: 0.0035 - val_mae: 0.0370 - learning_rate: 5.0000e-04
Epoch 17/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0029 - mae: 0.0320 - val_loss: 0.0034 - val_mae: 0.0359 - learning_rate: 5.0000e-04
Epoch 18/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0029 - mae: 0.0319 - val_loss: 0.0035 - val_mae: 0.0348 - learning_rate: 5.0000e-04
Epoch 19/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0028 - mae: 0.0314 - val_loss: 0.0034 - val_mae: 0.0351 - learning_rate: 5.0000e-04
Epoch 20/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 134s 2s/step - loss: 0.0028 - mae: 0.0316 - val_loss: 0.0034 - val_mae: 0.0352 - learning_rate: 5.0000e-04
Epoch 21/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0028 - mae: 0.0317 - val_loss: 0.0034 - val_mae: 0.0338 - learning_rate: 5.0000e-04
Epoch 22/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0027 - mae: 0.0309 - val_loss: 0.0034 - val_mae: 0.0339 - learning_rate: 2.5000e-04
Epoch 23/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0028 - mae: 0.0313 - val_loss: 0.0033 - val_mae: 0.0356 - learning_rate: 2.5000e-04
Epoch 24/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 128s 2s/step - loss: 0.0028 - mae: 0.0313 - val_loss: 0.0033 - val_mae: 0.0344 - learning_rate: 2.5000e-04
Epoch 25/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 128s 2s/step - loss: 0.0028 - mae: 0.0311 - val_loss: 0.0033 - val_mae: 0.0345 - learning_rate: 2.5000e-04
Epoch 26/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0027 - mae: 0.0306 - val_loss: 0.0033 - val_mae: 0.0342 - learning_rate: 2.5000e-04
Epoch 27/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0027 - mae: 0.0308 - val_loss: 0.0033 - val_mae: 0.0349 - learning_rate: 2.5000e-04
Epoch 28/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 127s 2s/step - loss: 0.0027 - mae: 0.0304 - val_loss: 0.0033 - val_mae: 0.0360 - learning_rate: 2.5000e-04
Epoch 29/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0027 - mae: 0.0307 - val_loss: 0.0033 - val_mae: 0.0349 - learning_rate: 1.2500e-04
Epoch 30/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0027 - mae: 0.0307 - val_loss: 0.0033 - val_mae: 0.0349 - learning_rate: 1.2500e-04
Epoch 31/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0027 - mae: 0.0304 - val_loss: 0.0033 - val_mae: 0.0349 - learning_rate: 1.2500e-04
Epoch 32/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0026 - mae: 0.0303 - val_loss: 0.0033 - val_mae: 0.0337 - learning_rate: 1.2500e-04
Epoch 33/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0027 - mae: 0.0300 - val_loss: 0.0033 - val_mae: 0.0342 - learning_rate: 1.2500e-04
Epoch 34/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0027 - mae: 0.0303 - val_loss: 0.0033 - val_mae: 0.0346 - learning_rate: 6.2500e-05
Epoch 35/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 134s 2s/step - loss: 0.0027 - mae: 0.0302 - val_loss: 0.0033 - val_mae: 0.0343 - learning_rate: 6.2500e-05
Epoch 36/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0033 - val_mae: 0.0350 - learning_rate: 6.2500e-05
Epoch 37/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0027 - mae: 0.0305 - val_loss: 0.0032 - val_mae: 0.0345 - learning_rate: 6.2500e-05
Epoch 38/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0352 - learning_rate: 6.2500e-05
Epoch 39/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0348 - learning_rate: 3.1250e-05
Epoch 40/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0026 - mae: 0.0302 - val_loss: 0.0032 - val_mae: 0.0348 - learning_rate: 3.1250e-05
Epoch 41/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 134s 2s/step - loss: 0.0027 - mae: 0.0303 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 3.1250e-05
Epoch 42/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 3.1250e-05
Epoch 43/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0026 - mae: 0.0300 - val_loss: 0.0032 - val_mae: 0.0360 - learning_rate: 3.1250e-05
Epoch 44/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0026 - mae: 0.0302 - val_loss: 0.0032 - val_mae: 0.0351 - learning_rate: 1.5625e-05
Epoch 45/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0027 - mae: 0.0302 - val_loss: 0.0032 - val_mae: 0.0343 - learning_rate: 1.5625e-05
Epoch 46/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0027 - mae: 0.0305 - val_loss: 0.0032 - val_mae: 0.0350 - learning_rate: 1.5625e-05
Epoch 47/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0027 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0343 - learning_rate: 1.5625e-05
Epoch 48/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 134s 2s/step - loss: 0.0026 - mae: 0.0298 - val_loss: 0.0032 - val_mae: 0.0349 - learning_rate: 1.5625e-05
Epoch 49/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 7.8125e-06
Epoch 50/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 134s 2s/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0032 - val_mae: 0.0346 - learning_rate: 7.8125e-06
Epoch 51/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0032 - val_mae: 0.0345 - learning_rate: 7.8125e-06
Epoch 52/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 133s 2s/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 7.8125e-06
Epoch 53/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 130s 2s/step - loss: 0.0027 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0346 - learning_rate: 7.8125e-06
Epoch 54/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0026 - mae: 0.0301 - val_loss: 0.0032 - val_mae: 0.0346 - learning_rate: 3.9063e-06
Epoch 55/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 131s 2s/step - loss: 0.0026 - mae: 0.0299 - val_loss: 0.0032 - val_mae: 0.0346 - learning_rate: 3.9063e-06
Epoch 56/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0026 - mae: 0.0300 - val_loss: 0.0032 - val_mae: 0.0345 - learning_rate: 3.9063e-06
Epoch 57/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 132s 2s/step - loss: 0.0026 - mae: 0.0300 - val_loss: 0.0032 - val_mae: 0.0343 - learning_rate: 3.9063e-06
Epoch 58/100
65/65 ━━━━━━━━━━━━━━━━━━━━ 129s 2s/step - loss: 0.0026 - mae: 0.0300 - val_loss: 0.0032 - val_mae: 0.0347 - learning_rate: 3.9063e-06
162/162 ━━━━━━━━━━━━━━━━━━━━ 14s 83ms/step - loss: 0.0074 - mae: 0.0365 
Test Loss: 0.0044
Test MAE: 0.0287

Back to top