Yield curve prediction
The goal of this case study is to use supervised learning-based models to predict the yield curve. This case study is inspired by the paper “Artificial Neural Networks in Fixed Income Markets for Yield Curve Forecasting” by Nunes, Gerding, McGroarty and Niranj
Content
# 1. Problem Definition
In the supervised regression framework used for this case study, three tenors (i.e. 1M, 5Y and 30Y) of the yield curve are the predicted variable. These tenors represent short term, medium term and long-term tenors of the yield curve.
Features
In order to make predictions, we use the following features:
1. Previous Changes in the Treasury Curve at the following tenors: 1 Month, 3 Month, 1 Year, 2 Year, 5 Year, 7 Year, 10 Year, 30 Year
2. Changes in % of Federal Debt held by -
a. Public,
b. Foreign Goverments
c. Federal Reserve
3. The Coporate Spread on Baa rated Debt Relative to the 10 Year
# 2. Getting Started- Loading the data and python packages
## 2.1. Loading the python packages
Feature Variables
# Load libraries
import numpy as np
import pandas as pd
import pandas_datareader.data as web
from matplotlib import pyplot
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
#Libraries for Deep Learning Models
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.layers import LSTM
from keras.wrappers.scikit_learn import KerasRegressor
#Libraries for Statistical Models
import statsmodels.api as sm
#Libraries for Saving the Model
from pickle import dump
from pickle import load
# Time series Models
from statsmodels.tsa.arima_model import ARIMA
#from statsmodels.tsa.statespace.sarimax import SARIMAX
# Error Metrics
from sklearn.metrics import mean_squared_error
# Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
#Plotting
from pandas.plotting import scatter_matrix
from statsmodels.graphics.tsaplots import plot_acf
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
## 2.2. Loading the Data
# Get the data by webscapping using pandas datareader
tsy_tickers = ['DGS1MO', 'DGS3MO', 'DGS1', 'DGS2', 'DGS5', 'DGS7', 'DGS10', 'DGS30',
'TREAST', # -- U.S. Treasury securities held by the Federal Reserve ( Millions of Dollars )
'FYGFDPUN', # -- Federal Debt Held by the Public ( Millions of Dollars )
'FDHBFIN', # -- Federal Debt Held by Foreign and International Investors ( Billions of Dollars )
'GFDEBTN', # -- Federal Debt: Total Public Debt ( Millions of Dollars )
'BAA10Y', # -- Baa Corporate Bond Yield Relative to Yield on 10-Year
]
tsy_data = web.DataReader(tsy_tickers, 'fred').dropna(how='all').ffill()
tsy_data['FDHBFIN'] = tsy_data['FDHBFIN'] * 1000
tsy_data['GOV_PCT'] = tsy_data['TREAST'] / tsy_data['GFDEBTN']
tsy_data['HOM_PCT'] = tsy_data['FYGFDPUN'] / tsy_data['GFDEBTN']
tsy_data['FOR_PCT'] = tsy_data['FDHBFIN'] / tsy_data['GFDEBTN']
return_period = 5
#Y = tsy_data.loc[:, ['DGS1MO', 'DGS5', 'DGS30']].diff(return_period).shift(-return_period)
#return_period = 5
Y = tsy_data.loc[:, ['DGS1MO', 'DGS5', 'DGS30']].shift(-return_period)
Y.columns = [col+'_pred' for col in Y.columns]
#X = tsy_data.loc[:, ['DGS1MO', 'DGS3MO', 'DGS1', 'DGS2', 'DGS5', 'DGS7', 'DGS10', 'DGS30', 'GOV_PCT', 'HOM_PCT', 'FOR_PCT', 'BAA10Y']].diff(return_period)
X = tsy_data.loc[:, ['DGS1MO', 'DGS3MO', 'DGS1', 'DGS2', 'DGS5', 'DGS7', 'DGS10', 'DGS30', 'GOV_PCT', 'HOM_PCT', 'FOR_PCT', 'BAA10Y']]
dataset = pd.concat([Y, X], axis=1).dropna().iloc[::return_period, :]
Y = dataset.loc[:, Y.columns]
X = dataset.loc[:, X.columns]
dataset.head()
| DGS1MO_pred | DGS5_pred | DGS30_pred | DGS1MO | DGS3MO | DGS1 | DGS2 | DGS5 | DGS7 | DGS10 | DGS30 | GOV_PCT | HOM_PCT | FOR_PCT | BAA10Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DATE | |||||||||||||||
| 2010-01-06 | 0.02 | 2.55 | 4.71 | 0.03 | 0.06 | 0.40 | 1.01 | 2.60 | 3.33 | 3.85 | 4.70 | 0.061 | 0.649 | 0.304 | 2.49 |
| 2010-01-13 | 0.02 | 2.38 | 4.50 | 0.02 | 0.06 | 0.37 | 0.97 | 2.55 | 3.28 | 3.80 | 4.71 | 0.061 | 0.649 | 0.304 | 2.50 |
| 2010-01-21 | 0.01 | 2.41 | 4.57 | 0.02 | 0.06 | 0.31 | 0.87 | 2.38 | 3.09 | 3.62 | 4.50 | 0.061 | 0.649 | 0.304 | 2.51 |
| 2010-01-28 | 0.04 | 2.29 | 4.53 | 0.01 | 0.08 | 0.31 | 0.87 | 2.41 | 3.15 | 3.68 | 4.57 | 0.061 | 0.649 | 0.304 | 2.57 |
| 2010-02-04 | 0.05 | 2.39 | 4.69 | 0.04 | 0.09 | 0.32 | 0.80 | 2.29 | 3.06 | 3.62 | 4.53 | 0.061 | 0.649 | 0.304 | 2.62 |
dataset
| DGS1MO_pred | DGS5_pred | DGS30_pred | DGS1MO | DGS3MO | DGS1 | DGS2 | DGS5 | DGS7 | DGS10 | DGS30 | GOV_PCT | HOM_PCT | FOR_PCT | BAA10Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DATE | |||||||||||||||
| 2010-01-06 | 0.02 | 2.55 | 4.71 | 0.03 | 0.06 | 0.40 | 1.01 | 2.60 | 3.33 | 3.85 | 4.70 | 0.061 | 0.649 | 0.304 | 2.49 |
| 2010-01-13 | 0.02 | 2.38 | 4.50 | 0.02 | 0.06 | 0.37 | 0.97 | 2.55 | 3.28 | 3.80 | 4.71 | 0.061 | 0.649 | 0.304 | 2.50 |
| 2010-01-21 | 0.01 | 2.41 | 4.57 | 0.02 | 0.06 | 0.31 | 0.87 | 2.38 | 3.09 | 3.62 | 4.50 | 0.061 | 0.649 | 0.304 | 2.51 |
| 2010-01-28 | 0.04 | 2.29 | 4.53 | 0.01 | 0.08 | 0.31 | 0.87 | 2.41 | 3.15 | 3.68 | 4.57 | 0.061 | 0.649 | 0.304 | 2.57 |
| 2010-02-04 | 0.05 | 2.39 | 4.69 | 0.04 | 0.09 | 0.32 | 0.80 | 2.29 | 3.06 | 3.62 | 4.53 | 0.061 | 0.649 | 0.304 | 2.62 |
| 2010-02-11 | 0.06 | 2.48 | 4.71 | 0.05 | 0.11 | 0.38 | 0.91 | 2.39 | 3.15 | 3.73 | 4.69 | 0.061 | 0.649 | 0.304 | 2.70 |
| 2010-02-19 | 0.09 | 2.30 | 4.55 | 0.06 | 0.11 | 0.39 | 0.95 | 2.48 | 3.24 | 3.78 | 4.71 | 0.061 | 0.649 | 0.304 | 2.65 |
| 2010-02-26 | 0.11 | 2.35 | 4.64 | 0.09 | 0.13 | 0.32 | 0.81 | 2.30 | 3.05 | 3.61 | 4.55 | 0.061 | 0.649 | 0.304 | 2.62 |
| 2010-03-05 | 0.10 | 2.42 | 4.62 | 0.11 | 0.15 | 0.38 | 0.91 | 2.35 | 3.10 | 3.69 | 4.64 | 0.061 | 0.649 | 0.304 | 2.61 |
| 2010-03-12 | 0.13 | 2.48 | 4.58 | 0.10 | 0.15 | 0.41 | 0.97 | 2.42 | 3.15 | 3.71 | 4.62 | 0.061 | 0.649 | 0.304 | 2.55 |
| 2010-03-19 | 0.11 | 2.59 | 4.75 | 0.13 | 0.16 | 0.42 | 1.02 | 2.48 | 3.16 | 3.70 | 4.58 | 0.061 | 0.649 | 0.304 | 2.49 |
| 2010-03-26 | 0.15 | 2.67 | 4.81 | 0.11 | 0.14 | 0.43 | 1.04 | 2.59 | 3.31 | 3.86 | 4.75 | 0.061 | 0.649 | 0.304 | 2.49 |
| 2010-04-02 | 0.16 | 2.65 | 4.74 | 0.15 | 0.16 | 0.46 | 1.11 | 2.67 | 3.40 | 3.96 | 4.81 | 0.059 | 0.654 | 0.308 | 2.44 |
| 2010-04-09 | 0.15 | 2.49 | 4.67 | 0.16 | 0.16 | 0.46 | 1.08 | 2.65 | 3.34 | 3.90 | 4.74 | 0.059 | 0.654 | 0.308 | 2.43 |
| 2010-04-16 | 0.14 | 2.61 | 4.67 | 0.15 | 0.16 | 0.41 | 0.98 | 2.49 | 3.20 | 3.79 | 4.67 | 0.059 | 0.654 | 0.308 | 2.43 |
| 2010-04-23 | 0.14 | 2.43 | 4.53 | 0.14 | 0.16 | 0.46 | 1.10 | 2.61 | 3.30 | 3.84 | 4.67 | 0.059 | 0.654 | 0.308 | 2.37 |
| 2010-04-30 | 0.08 | 2.17 | 4.28 | 0.14 | 0.16 | 0.41 | 0.97 | 2.43 | 3.12 | 3.69 | 4.53 | 0.059 | 0.654 | 0.308 | 2.38 |
| 2010-05-07 | 0.15 | 2.16 | 4.32 | 0.08 | 0.13 | 0.38 | 0.83 | 2.17 | 2.87 | 3.45 | 4.28 | 0.059 | 0.654 | 0.308 | 2.55 |
| 2010-05-14 | 0.17 | 2.02 | 4.07 | 0.15 | 0.16 | 0.34 | 0.79 | 2.16 | 2.87 | 3.44 | 4.32 | 0.059 | 0.654 | 0.308 | 2.59 |
| 2010-05-21 | 0.15 | 2.10 | 4.22 | 0.17 | 0.17 | 0.35 | 0.76 | 2.02 | 2.67 | 3.20 | 4.07 | 0.059 | 0.654 | 0.308 | 2.76 |
| 2010-05-28 | 0.10 | 1.95 | 4.11 | 0.15 | 0.16 | 0.34 | 0.76 | 2.10 | 2.75 | 3.31 | 4.22 | 0.059 | 0.654 | 0.308 | 2.89 |
| 2010-06-07 | 0.02 | 2.07 | 4.20 | 0.10 | 0.12 | 0.35 | 0.74 | 1.95 | 2.62 | 3.17 | 4.11 | 0.059 | 0.654 | 0.308 | 3.00 |
| 2010-06-14 | 0.05 | 2.05 | 4.17 | 0.02 | 0.07 | 0.31 | 0.77 | 2.07 | 2.73 | 3.28 | 4.20 | 0.059 | 0.654 | 0.308 | 3.06 |
| 2010-06-21 | 0.07 | 1.83 | 4.01 | 0.05 | 0.12 | 0.29 | 0.74 | 2.05 | 2.72 | 3.26 | 4.17 | 0.059 | 0.654 | 0.308 | 3.02 |
| 2010-06-28 | 0.17 | 1.76 | 3.89 | 0.07 | 0.17 | 0.30 | 0.62 | 1.83 | 2.49 | 3.05 | 4.01 | 0.059 | 0.654 | 0.308 | 3.10 |
| 2010-07-06 | 0.16 | 1.90 | 4.10 | 0.17 | 0.17 | 0.32 | 0.62 | 1.76 | 2.40 | 2.95 | 3.89 | 0.057 | 0.666 | 0.319 | 3.05 |
| 2010-07-13 | 0.16 | 1.71 | 3.99 | 0.16 | 0.15 | 0.30 | 0.67 | 1.90 | 2.57 | 3.15 | 4.10 | 0.057 | 0.666 | 0.319 | 3.03 |
| 2010-07-20 | 0.16 | 1.82 | 4.08 | 0.16 | 0.16 | 0.27 | 0.61 | 1.71 | 2.39 | 2.98 | 3.99 | 0.057 | 0.666 | 0.319 | 2.99 |
| 2010-07-27 | 0.15 | 1.55 | 4.04 | 0.16 | 0.15 | 0.30 | 0.65 | 1.82 | 2.51 | 3.08 | 4.08 | 0.057 | 0.666 | 0.319 | 2.90 |
| 2010-08-03 | 0.15 | 1.46 | 4.00 | 0.15 | 0.16 | 0.27 | 0.53 | 1.55 | 2.27 | 2.94 | 4.04 | 0.057 | 0.666 | 0.319 | 2.92 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2019-05-31 | 2.30 | 1.85 | 2.57 | 2.35 | 2.35 | 2.21 | 1.95 | 1.93 | 2.03 | 2.14 | 2.58 | 0.096 | 0.736 | 0.302 | 2.37 |
| 2019-06-07 | 2.23 | 1.85 | 2.59 | 2.30 | 2.28 | 1.97 | 1.85 | 1.85 | 1.97 | 2.09 | 2.57 | 0.096 | 0.736 | 0.302 | 2.42 |
| 2019-06-14 | 2.16 | 1.80 | 2.59 | 2.23 | 2.20 | 2.00 | 1.84 | 1.85 | 1.96 | 2.09 | 2.59 | 0.096 | 0.736 | 0.302 | 2.42 |
| 2019-06-21 | 2.18 | 1.76 | 2.52 | 2.16 | 2.11 | 1.95 | 1.77 | 1.80 | 1.93 | 2.07 | 2.59 | 0.096 | 0.736 | 0.302 | 2.35 |
| 2019-06-28 | 2.23 | 1.86 | 2.53 | 2.18 | 2.12 | 1.92 | 1.75 | 1.76 | 1.87 | 2.00 | 2.52 | 0.096 | 0.736 | 0.302 | 2.31 |
| 2019-07-08 | 2.17 | 1.84 | 2.61 | 2.23 | 2.26 | 1.99 | 1.88 | 1.86 | 1.94 | 2.05 | 2.53 | 0.092 | 0.741 | 0.298 | 2.23 |
| 2019-07-15 | 2.13 | 1.80 | 2.58 | 2.17 | 2.16 | 1.95 | 1.83 | 1.84 | 1.96 | 2.09 | 2.61 | 0.092 | 0.741 | 0.298 | 2.25 |
| 2019-07-22 | 2.12 | 1.84 | 2.59 | 2.13 | 2.09 | 1.95 | 1.80 | 1.80 | 1.92 | 2.05 | 2.58 | 0.092 | 0.741 | 0.298 | 2.24 |
| 2019-07-29 | 2.07 | 1.55 | 2.30 | 2.12 | 2.12 | 1.98 | 1.85 | 1.84 | 1.93 | 2.06 | 2.59 | 0.092 | 0.741 | 0.298 | 2.13 |
| 2019-08-05 | 2.09 | 1.49 | 2.14 | 2.07 | 2.05 | 1.78 | 1.59 | 1.55 | 1.63 | 1.75 | 2.30 | 0.092 | 0.741 | 0.298 | 2.24 |
| 2019-08-12 | 2.06 | 1.47 | 2.08 | 2.09 | 2.00 | 1.75 | 1.58 | 1.49 | 1.56 | 1.65 | 2.14 | 0.092 | 0.741 | 0.298 | 2.23 |
| 2019-08-19 | 2.09 | 1.43 | 2.04 | 2.06 | 1.94 | 1.75 | 1.53 | 1.47 | 1.54 | 1.60 | 2.08 | 0.092 | 0.741 | 0.298 | 2.29 |
| 2019-08-26 | 2.06 | 1.35 | 1.95 | 2.09 | 2.01 | 1.75 | 1.54 | 1.43 | 1.49 | 1.54 | 2.04 | 0.092 | 0.741 | 0.298 | 2.27 |
| 2019-09-03 | 2.04 | 1.58 | 2.19 | 2.06 | 1.98 | 1.72 | 1.47 | 1.35 | 1.42 | 1.47 | 1.95 | 0.092 | 0.741 | 0.298 | 2.28 |
| 2019-09-10 | 2.10 | 1.66 | 2.27 | 2.04 | 1.95 | 1.81 | 1.67 | 1.58 | 1.66 | 1.72 | 2.19 | 0.092 | 0.741 | 0.298 | 2.22 |
| 2019-09-17 | 1.90 | 1.52 | 2.09 | 2.10 | 1.99 | 1.87 | 1.72 | 1.66 | 1.75 | 1.81 | 2.27 | 0.092 | 0.741 | 0.298 | 2.21 |
| 2019-09-24 | 1.79 | 1.51 | 2.11 | 1.90 | 1.92 | 1.78 | 1.60 | 1.52 | 1.58 | 1.64 | 2.09 | 0.093 | 0.741 | 0.298 | 2.20 |
| 2019-10-01 | 1.69 | 1.36 | 2.04 | 1.79 | 1.82 | 1.73 | 1.56 | 1.51 | 1.59 | 1.65 | 2.11 | 0.093 | 0.741 | 0.298 | 2.22 |
| 2019-10-08 | 1.71 | 1.57 | 2.23 | 1.69 | 1.72 | 1.62 | 1.42 | 1.36 | 1.45 | 1.54 | 2.04 | 0.093 | 0.741 | 0.298 | 2.30 |
| 2019-10-16 | 1.74 | 1.58 | 2.25 | 1.71 | 1.66 | 1.59 | 1.58 | 1.57 | 1.65 | 1.75 | 2.23 | 0.093 | 0.741 | 0.298 | 2.22 |
| 2019-10-23 | 1.61 | 1.61 | 2.26 | 1.74 | 1.65 | 1.58 | 1.58 | 1.58 | 1.67 | 1.77 | 2.25 | 0.095 | 0.741 | 0.298 | 2.18 |
| 2019-10-30 | 1.55 | 1.63 | 2.30 | 1.61 | 1.62 | 1.59 | 1.61 | 1.61 | 1.69 | 1.78 | 2.26 | 0.096 | 0.741 | 0.298 | 2.17 |
| 2019-11-06 | 1.59 | 1.63 | 2.31 | 1.55 | 1.56 | 1.58 | 1.61 | 1.63 | 1.73 | 1.81 | 2.30 | 0.097 | 0.741 | 0.298 | 2.14 |
| 2019-11-14 | 1.57 | 1.62 | 2.24 | 1.59 | 1.57 | 1.55 | 1.58 | 1.63 | 1.73 | 1.82 | 2.31 | 0.097 | 0.741 | 0.298 | 2.13 |
| 2019-11-21 | 1.62 | 1.62 | 2.21 | 1.57 | 1.58 | 1.55 | 1.60 | 1.62 | 1.71 | 1.77 | 2.24 | 0.098 | 0.741 | 0.298 | 2.15 |
| 2019-11-29 | 1.52 | 1.67 | 2.29 | 1.62 | 1.59 | 1.60 | 1.61 | 1.62 | 1.73 | 1.78 | 2.21 | 0.099 | 0.741 | 0.298 | 2.08 |
| 2019-12-06 | 1.55 | 1.66 | 2.26 | 1.52 | 1.53 | 1.57 | 1.61 | 1.67 | 1.78 | 1.84 | 2.29 | 0.099 | 0.741 | 0.298 | 2.07 |
| 2019-12-13 | 1.57 | 1.73 | 2.34 | 1.55 | 1.57 | 1.54 | 1.61 | 1.66 | 1.76 | 1.82 | 2.26 | 0.100 | 0.741 | 0.298 | 2.02 |
| 2019-12-20 | 1.56 | 1.68 | 2.32 | 1.57 | 1.58 | 1.52 | 1.63 | 1.73 | 1.84 | 1.92 | 2.34 | 0.101 | 0.741 | 0.298 | 1.96 |
| 2019-12-27 | 1.52 | 1.59 | 2.26 | 1.56 | 1.57 | 1.51 | 1.59 | 1.68 | 1.80 | 1.88 | 2.32 | 0.103 | 0.741 | 0.298 | 1.96 |
505 rows × 15 columns
dataset.shape
(505, 15)
pd.set_option('precision', 3)
dataset.describe()
| DGS1MO_pred | DGS5_pred | DGS30_pred | DGS1MO | DGS3MO | DGS1 | DGS2 | DGS5 | DGS7 | DGS10 | DGS30 | GOV_PCT | HOM_PCT | FOR_PCT | BAA10Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 | 505.000 |
| mean | 0.544 | 1.644 | 3.180 | 0.541 | 0.578 | 0.746 | 0.961 | 1.646 | 2.041 | 2.400 | 3.185 | 0.110 | 0.711 | 0.320 | 2.588 |
| std | 0.780 | 0.593 | 0.612 | 0.779 | 0.797 | 0.810 | 0.750 | 0.595 | 0.558 | 0.553 | 0.614 | 0.022 | 0.023 | 0.018 | 0.451 |
| min | 0.000 | 0.570 | 1.950 | 0.000 | 0.000 | 0.090 | 0.180 | 0.570 | 0.930 | 1.400 | 1.950 | 0.057 | 0.649 | 0.285 | 1.580 |
| 25% | 0.030 | 1.250 | 2.800 | 0.030 | 0.050 | 0.160 | 0.370 | 1.250 | 1.600 | 1.960 | 2.810 | 0.101 | 0.702 | 0.304 | 2.240 |
| 50% | 0.120 | 1.620 | 3.040 | 0.110 | 0.130 | 0.300 | 0.680 | 1.620 | 2.050 | 2.340 | 3.050 | 0.113 | 0.720 | 0.324 | 2.600 |
| 75% | 0.950 | 1.960 | 3.450 | 0.850 | 1.000 | 1.220 | 1.330 | 1.970 | 2.370 | 2.760 | 3.450 | 0.126 | 0.724 | 0.338 | 2.900 |
| max | 2.450 | 3.070 | 4.810 | 2.450 | 2.480 | 2.740 | 2.960 | 3.070 | 3.400 | 3.960 | 4.810 | 0.137 | 0.741 | 0.341 | 3.600 |
Y.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x25e4c803b70>
# histograms
dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1, figsize=(12,12))
pyplot.show()
# density
dataset.plot(kind='density', subplots=True, layout=(5,4), sharex=False, legend=True, fontsize=1, figsize=(15,15))
pyplot.show()
#Box and Whisker Plots
dataset.plot(kind='box', subplots=True, layout=(5,4), sharex=False, sharey=False, figsize=(15,15))
pyplot.show()
Next We look at the interaction between these variables.
# correlation
correlation = dataset.corr()
pyplot.figure(figsize=(15,15))
pyplot.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
<matplotlib.axes._subplots.AxesSubplot at 0x25e3971e358>
Form the correlation plot, we see that the 1 month and the 30 year yield data points are negatively autocorrelated. The 5 year yield also seems toe be negativly correlated with the changes in foreign goverment purchases.
# Scatterplot Matrix
pyplot.figure(figsize=(15,15))
scatter_matrix(dataset,figsize=(15,16))
pyplot.show()
<Figure size 1080x1080 with 0 Axes>
1 Month
temp_Y = dataset['DGS1MO_pred']
res = sm.tsa.seasonal_decompose(temp_Y,freq=52)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
pyplot.show()
5 Year
temp_Y = dataset['DGS5_pred']
res = sm.tsa.seasonal_decompose(temp_Y,freq=52)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
pyplot.show()
30 Year
temp_Y = dataset['DGS30_pred']
res = sm.tsa.seasonal_decompose(temp_Y,freq=52)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
pyplot.show()
Around Q1 2018, we observe a trend decrease in the 1 Month, 5 year and 30 year. However, the trend is most pronounced in the 1 month series.
## 4. Data Preparation and analysis
## 4.1. Univariate Feature Selection
bestfeatures = SelectKBest(k=5, score_func=f_regression)
for col in Y.columns:
temp_Y = dataset[col]
temp_X = dataset.loc[:, X.columns]
fit = bestfeatures.fit(temp_X,temp_Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(col)
print(featureScores.nlargest(10,'Score')) #print 10 best features
print('--------------')
DGS1MO_pred
Specs Score
0 DGS1MO 152945.490
1 DGS3MO 100807.006
2 DGS1 11168.249
3 DGS2 3510.465
10 FOR_PCT 1010.264
11 BAA10Y 361.423
4 DGS5 342.936
9 HOM_PCT 243.490
5 DGS7 85.197
7 DGS30 59.793
--------------
DGS5_pred
Specs Score
4 DGS5 16564.639
5 DGS7 2720.883
3 DGS2 970.840
10 FOR_PCT 613.935
11 BAA10Y 586.571
6 DGS10 517.453
2 DGS1 505.191
1 DGS3MO 363.006
0 DGS1MO 326.866
7 DGS30 39.443
--------------
DGS30_pred
Specs Score
7 DGS30 17108.682
6 DGS10 1235.241
9 HOM_PCT 629.480
5 DGS7 252.017
8 GOV_PCT 183.786
0 DGS1MO 62.107
1 DGS3MO 60.412
2 DGS1 55.316
4 DGS5 40.429
3 DGS2 25.822
--------------
As expected, based on the univariate feature selection, all the time series are most dependent on the previous changes.
# 5. Evaluate Algorithms and Models
## 5.1. Train Test Split and evaluation metrics
# split out validation dataset for the end
validation_size = 0.2
#In case the data is not dependent on the time series, then train and test split randomly
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
We use the prebuilt scikit models to run a K fold analysis on our training data. We then train the model on the full training data and use it for prediction of the test data. The parameters for the K fold analysis are defined as -
# test options for regression
num_folds = 10
scoring = 'neg_mean_squared_error'
## 5.2. Compare Models and Algorithms
# spot check the algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
#Neural Network
models.append(('MLP', MLPRegressor()))
kfold_results = []
names = []
validation_results = []
train_results = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed)
#converted mean square error to positive. The lower the beter
cv_results = -1* cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
kfold_results.append(cv_results)
names.append(name)
# Finally we Train on the full period and test against validation
res = model.fit(X_train, Y_train)
validation_result = np.mean(np.square(res.predict(X_validation) - Y_validation))
validation_results.append(validation_result)
train_result = np.mean(np.square(res.predict(X_train) - Y_train))
train_results.append(train_result)
msg = "%s: \nAverage CV error: %s \nStd CV Error: (%s) \nTraining Error:\n%s \nTest Error:\n%s" % \
(name, str(cv_results.mean()), str(cv_results.std()), str(train_result), str(validation_result))
print(msg)
print('----------')
LR:
Average CV error: 0.007942891864184351
Std CV Error: (0.002627539557566172)
Training Error:
DGS1MO_pred 0.002
DGS5_pred 0.010
DGS30_pred 0.010
dtype: float64
Test Error:
DGS1MO_pred 0.001
DGS5_pred 0.009
DGS30_pred 0.010
dtype: float64
----------
LASSO:
Average CV error: 0.44035581436209686
Std CV Error: (0.05366688435468398)
Training Error:
DGS1MO_pred 0.574
DGS5_pred 0.352
DGS30_pred 0.388
dtype: float64
Test Error:
DGS1MO_pred 0.743
DGS5_pred 0.350
DGS30_pred 0.318
dtype: float64
----------
EN:
Average CV error: 0.4019832745740823
Std CV Error: (0.050762635401215755)
Training Error:
DGS1MO_pred 0.455
DGS5_pred 0.352
DGS30_pred 0.388
dtype: float64
Test Error:
DGS1MO_pred 0.592
DGS5_pred 0.350
DGS30_pred 0.318
dtype: float64
----------
KNN:
Average CV error: 0.009184607723577237
Std CV Error: (0.003118097916218511)
Training Error:
DGS1MO_pred 0.002
DGS5_pred 0.008
DGS30_pred 0.007
dtype: float64
Test Error:
DGS1MO_pred 0.002
DGS5_pred 0.011
DGS30_pred 0.011
dtype: float64
----------
CART:
Average CV error: 0.017137121951219508
Std CV Error: (0.003920838626872242)
Training Error:
DGS1MO_pred 0.0
DGS5_pred 0.0
DGS30_pred 0.0
dtype: float64
Test Error:
DGS1MO_pred 0.003
DGS5_pred 0.019
DGS30_pred 0.017
dtype: float64
----------
MLP:
Average CV error: 0.015686185139476356
Std CV Error: (0.004980051035156162)
Training Error:
DGS1MO_pred 0.003
DGS5_pred 0.011
DGS30_pred 0.025
dtype: float64
Test Error:
DGS1MO_pred 0.002
DGS5_pred 0.009
DGS30_pred 0.026
dtype: float64
----------
# compare algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(kfold_results)
ax.set_xticklabels(names)
fig.set_size_inches(15,8)
pyplot.show()
# compare algorithms
fig = pyplot.figure()
ind = np.arange(len(names)) # the x locations for the groups
width = 0.35 # the width of the bars
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.bar(ind - width/2, [x.mean() for x in train_results], width=width, label='Train Error')
pyplot.bar(ind + width/2, [x.mean() for x in validation_results], width=width, label='Validation Error')
fig.set_size_inches(15,8)
pyplot.legend()
ax.set_xticks(ind)
ax.set_xticklabels(names)
pyplot.show()
# 7. Grid search : MLPRegressor
'''
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
The ith element represents the number of neurons in the ith
hidden layer.
'''
param_grid={'hidden_layer_sizes': [(20,), (50,), (20,20), (20, 30, 20)]}
model = MLPRegressor()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(X_train, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Best: -0.018006 using {'hidden_layer_sizes': (20, 30, 20)}
-0.036433 (0.019326) with: {'hidden_layer_sizes': (20,)}
-0.020793 (0.007075) with: {'hidden_layer_sizes': (50,)}
-0.026638 (0.010154) with: {'hidden_layer_sizes': (20, 20)}
-0.018006 (0.005637) with: {'hidden_layer_sizes': (20, 30, 20)}
# 7. Finalise the Model
# prepare model
model = MLPRegressor(hidden_layer_sizes= (20, 30, 20))
model.fit(X_train, Y_train)
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(20, 30, 20), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver='adam', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
## 7.1. Results and comparison of Regression and MLP
# estimate accuracy on validation set
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
predictions = model.predict(X_validation)
mse_MLP = mean_squared_error(Y_validation, predictions)
r2_MLP = r2_score(Y_validation, predictions)
# prepare model
model_2 = LinearRegression()
model_2.fit(X_train, Y_train)
predictions_2 = model_2.predict(X_validation)
mse_OLS = mean_squared_error(Y_validation, predictions_2)
r2_OLS = r2_score(Y_validation, predictions_2)
print("MSE Regression = %f, MSE MLP = %f" % (mse_OLS, mse_MLP ))
print("R2 Regression = %f, R2 MLP = %f" % (r2_OLS, r2_MLP ))
MSE Regression = 0.006727, MSE MLP = 0.015661
R2 Regression = 0.979949, R2 MLP = 0.954716
The statistics of MLP and Linear regression are comparable. Let us check the prediction shape on the validation set.
Predictions - 5 Year - MLP
train_size = int(len(X) * (1-validation_size))
X_train, X_validation = X[0:train_size], X[train_size:len(X)]
Y_train, Y_validation = Y[0:train_size], Y[train_size:len(X)]
modelMLP = MLPRegressor(hidden_layer_sizes= (50,))
modelOLS = LinearRegression()
model_MLP = modelMLP.fit(X_train, Y_train)
model_OLS = modelOLS.fit(X_train, Y_train)
Y_predMLP = pd.DataFrame(model_MLP.predict(X_validation), index=Y_validation.index,
columns=Y_validation.columns)
Y_predOLS = pd.DataFrame(model_OLS.predict(X_validation), index=Y_validation.index,
columns=Y_validation.columns)
pd.DataFrame({'Actual : 1m': Y_validation.loc[:, 'DGS1MO_pred'],
'Prediction MLP 1m': Y_predMLP.loc[:, 'DGS1MO_pred'],
'Prediction OLS 1m': Y_predOLS.loc[:, 'DGS1MO_pred']}).plot(figsize=(10,5))
pd.DataFrame({'Actual : 5yr': Y_validation.loc[:, 'DGS5_pred'],
'Prediction MLP 5yr': Y_predMLP.loc[:, 'DGS5_pred'],
'Prediction OLS 5yr': Y_predOLS.loc[:, 'DGS5_pred']}).plot(figsize=(10,5))
pd.DataFrame({'Actual : 30yr': Y_validation.loc[:, 'DGS30_pred'],
'Prediction MLP 30yr': Y_predMLP.loc[:, 'DGS30_pred'],
'Prediction OLS 30yr': Y_predOLS.loc[:, 'DGS30_pred']}).plot(figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x25e44811860>
Overall, the regression and MLP are comparable, however, for 1m tenor, the fitting with MLP is slighly poor as compared to the regression. However,the multitask learning with neural network is more intuitive for modeling many time series simultaneousl
Summary
The linear regression model, despite its simplicity, is a tough benchmark to beat for such one step ahead forecasting, given the dominant characteristic of the last available value of the variable to predict. The ANN results in this case study are comparable to the linear regression models.
The good thing about ANN is that it is more flexible to changing market conditions. Also, ANN models can be enhanced by performing grid search on several other hyperparameters and using recurrent neural network such as LSTM.