Portfolio Management-Eigen Portfolio

In this case study we use dimensionality reduction techniques for portfolio management and allocation.

Content

1. Problem Definition
2. Getting Started - Load Libraries and Dataset
- 2.1. Load Libraries
- 2.2. Load Dataset
3. Exploratory Data Analysis
- 3.1 Descriptive Statistics
- 3.2. Data Visualisation
4. Data Preparation
- 4.1 Data Cleaning
- 4.3.Data Transformation
5.Evaluate Algorithms and Models
- 5.1. Train Test Split
- 5.2. Model Evaluation- Applying Principle Component Analysis

# 1. Problem Definition

Our goal in this case study is to maximize risk-adjusted returns using dimensionality reduction-based algorithm on a dataset of stocks to allocate capital into different asset classes.

The dataset used for this case study is Dow Jones Industrial Average (DJIA) index and its respective 30 stocks from year 2000 onwards. The dataset can be downloaded from yahoo finance.

# 2. Getting Started- Loading the data and python packages

## 2.1. Loading the python packages

   # Load libraries
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   from pandas import read_csv, set_option
   from pandas.plotting import scatter_matrix
   import seaborn as sns
   from sklearn.preprocessing import StandardScaler

   #Import Model Packages
   from sklearn.decomposition import PCA
   from sklearn.decomposition import TruncatedSVD
   from numpy.linalg import inv, eig, svd

   from sklearn.manifold import TSNE
   from sklearn.decomposition import KernelPCA

## 2.2. Loading the Data

# load dataset
dataset = read_csv('Dow_adjcloses.csv',index_col=0)

#Diable the warnings
import warnings
warnings.filterwarnings('ignore')

type(dataset)

   pandas.core.frame.DataFrame

# 3. Exploratory Data Analysis

## 3.1. Descriptive Statistics

# shape
dataset.shape

(4804, 30)

# peek at data
set_option('display.width', 100)
dataset.head(5)

	MMM	AXP	AAPL	BA	CAT	CVX	CSCO	KO	DIS	DWDP	...	NKE	PFE	PG	TRV	UTX	UNH	VZ	V	WMT	WBA
Date
2000-01-03	29.847043	35.476634	3.530576	26.650218	14.560887	21.582046	43.003876	16.983583	23.522220	NaN	...	4.701180	16.746856	32.227726	20.158885	21.319030	5.841355	22.564221	NaN	47.337599	21.713237
2000-01-04	28.661131	34.134275	3.232839	26.610431	14.372251	21.582046	40.577200	17.040950	24.899860	NaN	...	4.445214	16.121738	31.596399	19.890099	20.445803	5.766368	21.833915	NaN	45.566248	20.907354
2000-01-05	30.122175	33.959430	3.280149	28.473758	14.914205	22.049145	40.895453	17.228147	25.781550	NaN	...	4.702157	16.415912	31.325831	20.085579	20.254784	5.753327	22.564221	NaN	44.503437	21.097421
2000-01-06	31.877325	33.959430	2.996290	28.553331	15.459153	22.903343	39.781569	17.210031	24.899860	NaN	...	4.677733	16.972739	32.438168	20.122232	20.998392	5.964159	22.449405	NaN	45.126952	20.527220
2000-01-07	32.509812	34.433913	3.138219	29.382213	15.962182	23.305926	42.128682	18.342270	24.506249	NaN	...	4.677733	18.123166	35.023602	20.922479	21.830687	6.662948	22.282692	NaN	48.535033	21.051805

5 rows × 30 columns

# types
set_option('display.max_rows', 500)
dataset.dtypes

MMM     float64
AXP     float64
AAPL    float64
BA      float64
CAT     float64
CVX     float64
CSCO    float64
KO      float64
DIS     float64
DWDP    float64
XOM     float64
GS      float64
HD      float64
IBM     float64
INTC    float64
JNJ     float64
JPM     float64
MCD     float64
MRK     float64
MSFT    float64
NKE     float64
PFE     float64
PG      float64
TRV     float64
UTX     float64
UNH     float64
VZ      float64
V       float64
WMT     float64
WBA     float64
dtype: object

# describe data
set_option('precision', 3)
dataset.describe()

	MMM	AXP	AAPL	BA	CAT	CVX	CSCO	KO	DIS	DWDP	...	NKE	PFE	PG	TRV	UTX	UNH	VZ	V	WMT	WBA
count	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	363.000	...	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	2741.000	4804.000	4804.000
mean	86.769	49.659	49.107	85.482	56.697	61.735	21.653	24.984	46.368	64.897	...	23.724	20.737	49.960	55.961	62.209	64.418	27.193	53.323	50.767	41.697
std	53.942	22.564	55.020	79.085	34.663	31.714	10.074	10.611	32.733	5.768	...	20.988	7.630	19.769	34.644	32.627	62.920	11.973	37.647	17.040	19.937
min	25.140	8.713	0.828	17.463	9.247	17.566	6.842	11.699	11.018	49.090	...	2.595	8.041	16.204	13.287	14.521	5.175	11.210	9.846	30.748	17.317
25%	51.192	34.079	3.900	37.407	26.335	31.820	14.910	15.420	22.044	62.250	...	8.037	15.031	35.414	29.907	34.328	23.498	17.434	18.959	38.062	27.704
50%	63.514	42.274	23.316	58.437	53.048	56.942	18.578	20.563	29.521	66.586	...	14.147	18.643	46.735	39.824	55.715	42.924	21.556	45.207	42.782	32.706
75%	122.906	66.816	84.007	112.996	76.488	91.688	24.650	34.927	75.833	69.143	...	36.545	25.403	68.135	80.767	92.557	73.171	38.996	76.966	65.076	58.165
max	251.981	112.421	231.260	411.110	166.832	128.680	63.698	50.400	117.973	75.261	...	85.300	45.841	98.030	146.564	141.280	286.330	60.016	150.525	107.010	90.188

8 rows × 30 columns

## 3.2. Data Visualization

Taking a look at the correlation. More detailed look at the data will be performed after implementing the Dimensionality Reduction Models.

# correlation
correlation = dataset.corr()
plt.figure(figsize=(15,15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')

<matplotlib.axes._subplots.AxesSubplot at 0x1e1b1d9eeb8>

As it can be seen by the chart above, there is a significant positive correlation between the stocks.

## 4. Data Preparation

## 4.1. Data Cleaning Let us check for the NAs in the rows, either drop

them or fill them with the mean of the column

#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())

Null Values = True

Getting rid of the columns with more than 30% missing values.

missing_fractions = dataset.isnull().mean().sort_values(ascending=False)

missing_fractions.head(10)

drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))

dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape

(4804, 28)

Given that there are null values drop the rown contianing the null values.

# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')

# Drop the rows containing NA
dataset= dataset.dropna(axis=0)
# Fill na with 0
#dataset.fillna('0')

dataset.head(2)

	MMM	AXP	AAPL	BA	CAT	CVX	CSCO	KO	DIS	XOM	...	MSFT	NKE	PFE	PG	TRV	UTX	UNH	VZ	WMT	WBA
Date
2000-01-03	29.847	35.477	3.531	26.65	14.561	21.582	43.004	16.984	23.522	23.862	...	38.135	4.701	16.747	32.228	20.159	21.319	5.841	22.564	47.338	21.713
2000-01-04	28.661	34.134	3.233	26.61	14.372	21.582	40.577	17.041	24.900	23.405	...	36.846	4.445	16.122	31.596	19.890	20.446	5.766	21.834	45.566	20.907

2 rows × 28 columns

Computing Daily Return

   # Daily Log Returns (%)
   # datareturns = np.log(data / data.shift(1))

   # Daily Linear Returns (%)
   datareturns = dataset.pct_change(1)

   #Remove Outliers beyong 3 standard deviation
   datareturns= datareturns[datareturns.apply(lambda x :(x-x.mean()).abs()<(3*x.std()) ).all(1)]

## 4.2. Data Transformation

All the variables should be on the same scale before applying PCA, otherwise a feature with large values will dominate the result. Below we use StandardScaler in sklearn to standardize the dataset’s features onto unit scale (mean = 0 and variance = 1).

Standardization is a useful technique to transform attributes to a standard Normal distribution with a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(datareturns)
rescaledDataset = pd.DataFrame(scaler.fit_transform(datareturns),columns = datareturns.columns, index = datareturns.index)
# summarize transformed data
datareturns.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
rescaledDataset.head(2)

	MMM	AXP	AAPL	BA	CAT	CVX	CSCO	KO	DIS	XOM	...	MSFT	NKE	PFE	PG	TRV	UTX	UNH	VZ	WMT	WBA
Date
2000-01-11	-1.713	0.566	-2.708	-1.133	-1.041	-0.787	-1.834	3.569	0.725	0.981	...	-1.936	3.667	-0.173	1.772	-0.936	-1.954	0.076	-0.836	-1.375	2.942
2000-01-20	-3.564	1.077	3.304	-1.670	-2.834	-0.446	0.022	0.987	-2.415	-1.897	...	-0.733	-1.816	-1.421	-2.742	-0.476	-1.916	1.654	0.241	-0.987	-0.036

2 rows × 28 columns

# Visualizing Log Returns for the DJIA
plt.figure(figsize=(16, 5))
plt.title("AAPL Return")
plt.ylabel("Return")
rescaledDataset.AAPL.plot()
plt.grid(True);
plt.legend()
plt.show()

The portfolio is divided into train and test split to perform the analysis regarding the best porfolio and backtesting shown later.

   # Dividing the dataset into training and testing sets
   percentage = int(len(rescaledDataset) * 0.8)
   X_train = rescaledDataset[:percentage]
   X_test = rescaledDataset[percentage:]

   X_train_raw = datareturns[:percentage]
   X_test_raw = datareturns[percentage:]


   stock_tickers = rescaledDataset.columns.values
   n_tickers = len(stock_tickers)

## 5.2. Model Evaluation- Applying Principle Component Analysis

As this step, we create a function to compute principle component analysis from sklearn. This function computes an inversed elbow chart that shows the amount of principle components and how many of them explain the variance treshold.

pca = PCA()
PrincipalComponent=pca.fit(X_train)

First Principal Component /Eigenvector

pca.components_[0]

   array([-0.2278224 , -0.22835766, -0.15302828, -0.18969933, -0.20200012,
          -0.17810558, -0.19508121, -0.16845303, -0.20820442, -0.19308548,
          -0.20879404, -0.20231768, -0.19939638, -0.19521427, -0.16686975,
          -0.22806024, -0.15153408, -0.169941  , -0.19367262, -0.17118841,
          -0.18993347, -0.16805969, -0.197612  , -0.22658993, -0.13821257,
          -0.16688803, -0.16897835, -0.16070821])



## 5.2.1.Explained Variance using PCA

NumEigenvalues=10
fig, axes = plt.subplots(ncols=2, figsize=(14,4))
Series1 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).sort_values()*100
Series2 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).cumsum()*100
Series1.plot.barh(ylim=(0,9), label="woohoo",title='Explained Variance Ratio by Top 10 factors',ax=axes[0]);
Series2.plot(ylim=(0,100),xlim=(0,9),ax=axes[1], title='Cumulative Explained Variance by factor');
# explained_variance
pd.Series(np.cumsum(pca.explained_variance_ratio_)).to_frame('Explained Variance').head(NumEigenvalues).style.format('{:,.2%}'.format)

	Explained Variance
0	37.03%
1	42.75%
2	47.10%
3	51.08%
4	54.60%
5	57.74%
6	60.65%
7	63.44%
8	66.18%
9	68.71%

We find that the most important factor explains around 40% of the daily return variation. The dominant factor is usually interpreted as ‘the market’, depending on the results of closer inspection.

The plot on the right shows the cumulative explained variance and indicates that around 10 factors explain 73% of the returns of this large cross-section of stocks.

## 5.2.2.Looking at Portfolio weights

We compute several functions to determine the weights of each principle component. We then visualize a scatterplot that visualizes an organized descending plot with the respective weight of every company at the current chosen principle component.

def PCWeights():
    '''
    Principal Components (PC) weights for each 28 PCs
    '''
    weights = pd.DataFrame()

    for i in range(len(pca.components_)):
        weights["weights_{}".format(i)] = pca.components_[i] / sum(pca.components_[i])

    weights = weights.values.T
    return weights

weights=PCWeights()

weights[0]

array([0.04341287, 0.04351486, 0.02916042, 0.0361483 , 0.03849228,
       0.03393904, 0.03717385, 0.03209969, 0.03967455, 0.03679355,
       0.0397869 , 0.0385528 , 0.03799613, 0.0371992 , 0.03179799,
       0.04345819, 0.02887569, 0.03238323, 0.03690543, 0.03262094,
       0.03619291, 0.03202474, 0.0376561 , 0.04317801, 0.0263372 ,
       0.03180147, 0.0321998 , 0.03062387])

pca.components_[0]

array([-0.2278224 , -0.22835766, -0.15302828, -0.18969933, -0.20200012,
       -0.17810558, -0.19508121, -0.16845303, -0.20820442, -0.19308548,
       -0.20879404, -0.20231768, -0.19939638, -0.19521427, -0.16686975,
       -0.22806024, -0.15153408, -0.169941  , -0.19367262, -0.17118841,
       -0.18993347, -0.16805969, -0.197612  , -0.22658993, -0.13821257,
       -0.16688803, -0.16897835, -0.16070821])

weights[0]

array([0.04341287, 0.04351486, 0.02916042, 0.0361483 , 0.03849228,
       0.03393904, 0.03717385, 0.03209969, 0.03967455, 0.03679355,
       0.0397869 , 0.0385528 , 0.03799613, 0.0371992 , 0.03179799,
       0.04345819, 0.02887569, 0.03238323, 0.03690543, 0.03262094,
       0.03619291, 0.03202474, 0.0376561 , 0.04317801, 0.0263372 ,
       0.03180147, 0.0321998 , 0.03062387])

NumComponents=5

topPortfolios = pd.DataFrame(pca.components_[:NumComponents], columns=dataset.columns)
eigen_portfolios = topPortfolios.div(topPortfolios.sum(1), axis=0)
eigen_portfolios.index = [f'Portfolio {i}' for i in range( NumComponents)]
np.sqrt(pca.explained_variance_)
eigen_portfolios.T.plot.bar(subplots=True, layout=(int(NumComponents),1), figsize=(14,10), legend=False, sharey=True, ylim= (-1,1))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B79E3208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B7828048>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B78FA320>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B798D668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B7983860>]],
      dtype=object)

# plotting heatmap
sns.heatmap(topPortfolios)

<matplotlib.axes._subplots.AxesSubplot at 0x1e1b4410898>

The heatmap and the plot above shown the contribution of different stocks in each eigenvector.

## 5.2.3. Finding the Best Eigen Portfolio

In order to find the best eigen portfolios and perform backtesting in the next step, we use the sharpe ratio, which is a performance metric that explains the annualized returns against the annualized volatility of each company in a portfolio. A high sharpe ratio explains higher returns and/or lower volatility for the specified portfolio. The annualized sharpe ratio is computed by dividing the annualized returns against the annualized volatility. For annualized return we apply the geometric average of all the returns in respect to the periods per year (days of operations in the exchange in a year). Annualized volatility is computed by taking the standard deviation of the returns and multiplying it by the square root of the peri‐ ods per year.

# Sharpe Ratio
def sharpe_ratio(ts_returns, periods_per_year=252):
    '''
    Sharpe ratio is the average return earned in excess of the risk-free rate per unit of volatility or total risk.
    It calculares the annualized return, annualized volatility, and annualized sharpe ratio.

    ts_returns are  returns of a signle eigen portfolio.
    '''
    n_years = ts_returns.shape[0]/periods_per_year
    annualized_return = np.power(np.prod(1+ts_returns),(1/n_years))-1
    annualized_vol = ts_returns.std() * np.sqrt(periods_per_year)
    annualized_sharpe = annualized_return / annualized_vol

    return annualized_return, annualized_vol, annualized_sharpe

We construct a loop to compute the principle component’s weights for each eigen portfolio, which then uses the sharpe ratio function to look for the portfolio with the highest sharpe ratio. Once we know which portfolio has the highest sharpe ratio, we can visualize its performance against the DJIA Index for comparison.

def optimizedPortfolio():
    n_portfolios = len(pca.components_)
    annualized_ret = np.array([0.] * n_portfolios)
    sharpe_metric = np.array([0.] * n_portfolios)
    annualized_vol = np.array([0.] * n_portfolios)
    highest_sharpe = 0
    stock_tickers = rescaledDataset.columns.values
    n_tickers = len(stock_tickers)
    pcs = pca.components_

    for i in range(n_portfolios):

        pc_w = pcs[i] / sum(pcs[i])
        eigen_prtfi = pd.DataFrame(data ={'weights': pc_w.squeeze()*100}, index = stock_tickers)
        eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)
        eigen_prti_returns = np.dot(X_train_raw.loc[:, eigen_prtfi.index], pc_w)
        eigen_prti_returns = pd.Series(eigen_prti_returns.squeeze(), index=X_train_raw.index)
        er, vol, sharpe = sharpe_ratio(eigen_prti_returns)
        annualized_ret[i] = er
        annualized_vol[i] = vol
        sharpe_metric[i] = sharpe

        sharpe_metric= np.nan_to_num(sharpe_metric)

    # find portfolio with the highest Sharpe ratio
    highest_sharpe = np.argmax(sharpe_metric)

    print('Eigen portfolio #%d with the highest Sharpe. Return %.2f%%, vol = %.2f%%, Sharpe = %.2f' %
          (highest_sharpe,
           annualized_ret[highest_sharpe]*100,
           annualized_vol[highest_sharpe]*100,
           sharpe_metric[highest_sharpe]))


    fig, ax = plt.subplots()
    fig.set_size_inches(12, 4)
    ax.plot(sharpe_metric, linewidth=3)
    ax.set_title('Sharpe ratio of eigen-portfolios')
    ax.set_ylabel('Sharpe ratio')
    ax.set_xlabel('Portfolios')

    results = pd.DataFrame(data={'Return': annualized_ret, 'Vol': annualized_vol, 'Sharpe': sharpe_metric})
    results.dropna(inplace=True)
    results.sort_values(by=['Sharpe'], ascending=False, inplace=True)
    print(results.head(20))

    plt.show()

optimizedPortfolio()

Eigen portfolio #0 with the highest Sharpe. Return 11.47%, vol = 13.31%, Sharpe = 0.86
    Return    Vol  Sharpe
  0.115  0.133   0.862
  0.096  0.693   0.138
  0.100  0.845   0.118
  0.057  0.670   0.084
 -0.107  0.859  -0.124
-1.000  7.228  -0.138
-0.399  2.070  -0.193
-1.000  5.009  -0.200
-1.000  4.955  -0.202
 -0.416  1.967  -0.212
-0.158  0.738  -0.213
 -0.162  0.738  -0.220
-1.000  4.535  -0.220
 -0.422  1.397  -0.302
-0.998  3.277  -0.305
-0.550  1.729  -0.318
-0.980  3.038  -0.323
-0.470  1.420  -0.331
-0.886  2.571  -0.345
-0.933  2.606  -0.358

As shown from the results above, the portfolio 12 is the best portfolio and has the maximum sharp ratio out of all the porfolio. Let us look at the composition of this portfolio.

weights = PCWeights()
portfolio = portfolio = pd.DataFrame()

def plotEigen(weights, plot=False, portfolio=portfolio):
    portfolio = pd.DataFrame(data ={'weights': weights.squeeze()*100}, index = stock_tickers)
    portfolio.sort_values(by=['weights'], ascending=False, inplace=True)
    if plot:
        print('Sum of weights of current eigen-portfolio: %.2f' % np.sum(portfolio))
        portfolio.plot(title='Current Eigen-Portfolio Weights',
            figsize=(12,6),
            xticks=range(0, len(stock_tickers),1),
            rot=45,
            linewidth=3
            )
        plt.show()


    return portfolio

# Weights are stored in arrays, where 0 is the first PC's weights.
plotEigen(weights=weights[0], plot=True)

Sum of weights of current eigen-portfolio: 100.00

	weights
AXP	4.351
JPM	4.346
MMM	4.341
UTX	4.318
GS	3.979
DIS	3.967
HD	3.855
CAT	3.849
IBM	3.800
TRV	3.766
INTC	3.720
CSCO	3.717
MSFT	3.691
XOM	3.679
PFE	3.619
BA	3.615
CVX	3.394
NKE	3.262
MRK	3.238
WMT	3.220
KO	3.210
PG	3.202
VZ	3.180
JNJ	3.180
WBA	3.062
AAPL	2.916
MCD	2.888
UNH	2.634

The chart shows the allocation of the best portfolio. The weights in the chart are in percentages.

## 5.2.4. Backtesting Eigenportfolio

We will now try to backtest this algorithm on the test set, by looking at few top and bottom portfolios.

def Backtest(eigen):

    '''

    Plots Principle components returns against real returns.

    '''

    eigen_prtfi = pd.DataFrame(data ={'weights': eigen.squeeze()}, index = stock_tickers)
    eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)

    eigen_prti_returns = np.dot(X_test_raw.loc[:, eigen_prtfi.index], eigen)
    eigen_portfolio_returns = pd.Series(eigen_prti_returns.squeeze(), index=X_test_raw.index)
    returns, vol, sharpe = sharpe_ratio(eigen_portfolio_returns)
    print('Current Eigen-Portfolio:\nReturn = %.2f%%\nVolatility = %.2f%%\nSharpe = %.2f' % (returns*100, vol*100, sharpe))
    equal_weight_return=(X_test_raw * (1/len(pca.components_))).sum(axis=1)
    df_plot = pd.DataFrame({'EigenPorfolio Return': eigen_portfolio_returns, 'Equal Weight Index': equal_weight_return}, index=X_test.index)
    np.cumprod(df_plot + 1).plot(title='Returns of the equal weighted index vs. eigen-portfolio' ,
                          figsize=(12,6), linewidth=3)
    plt.show()

Backtest(eigen=weights[5])
Backtest(eigen=weights[1])
Backtest(eigen=weights[14])

Current Eigen-Portfolio:
Return = 32.76%
Volatility = 68.64%
Sharpe = 0.48

Current Eigen-Portfolio:
Return = 99.80%
Volatility = 58.34%
Sharpe = 1.71

Current Eigen-Portfolio:
Return = -79.42%
Volatility = 185.30%
Sharpe = -0.43

As shown in charts above the eigen portfolio return of the top portfolios outperform the equally weighted portfolio and the eigen portfolio ranked 19 underperformed the market significantly in the test set.

Conclusion

In terms of the intuition behind the eigen portfolios, we demonstrated that the first eigen portfolio represents a systematic risk factor and other eigen portfolio may represent sector or industry factor. We discuss diversification benefits offered by the eigen portfolios as they are derived using PCA and are independent.

Looking at the backtesting result, the portfolio with the best result in the training set leads to the best result in the test set. By using PCA, we get independent eigen portfo‐ lios with higher return and sharp ratio as compared to market.