Portfolio Management-Eigen Portfolio

In this case study we use dimensionality reduction techniques for portfolio management and allocation.

Content

# 1. Problem Definition

Our goal in this case study is to maximize risk-adjusted returns using dimensionality reduction-based algorithm on a dataset of stocks to allocate capital into different asset classes.

The dataset used for this case study is Dow Jones Industrial Average (DJIA) index and its respective 30 stocks from year 2000 onwards. The dataset can be downloaded from yahoo finance.

# 2. Getting Started- Loading the data and python packages

## 2.1. Loading the python packages

   # Load libraries
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   from pandas import read_csv, set_option
   from pandas.plotting import scatter_matrix
   import seaborn as sns
   from sklearn.preprocessing import StandardScaler

   #Import Model Packages
   from sklearn.decomposition import PCA
   from sklearn.decomposition import TruncatedSVD
   from numpy.linalg import inv, eig, svd

   from sklearn.manifold import TSNE
   from sklearn.decomposition import KernelPCA

## 2.2. Loading the Data
# load dataset
dataset = read_csv('Dow_adjcloses.csv',index_col=0)
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
type(dataset)
   pandas.core.frame.DataFrame



# 3. Exploratory Data Analysis

## 3.1. Descriptive Statistics
# shape
dataset.shape
(4804, 30)
# peek at data
set_option('display.width', 100)
dataset.head(5)
MMM AXP AAPL BA CAT CVX CSCO KO DIS DWDP ... NKE PFE PG TRV UTX UNH VZ V WMT WBA
Date
2000-01-03 29.847043 35.476634 3.530576 26.650218 14.560887 21.582046 43.003876 16.983583 23.522220 NaN ... 4.701180 16.746856 32.227726 20.158885 21.319030 5.841355 22.564221 NaN 47.337599 21.713237
2000-01-04 28.661131 34.134275 3.232839 26.610431 14.372251 21.582046 40.577200 17.040950 24.899860 NaN ... 4.445214 16.121738 31.596399 19.890099 20.445803 5.766368 21.833915 NaN 45.566248 20.907354
2000-01-05 30.122175 33.959430 3.280149 28.473758 14.914205 22.049145 40.895453 17.228147 25.781550 NaN ... 4.702157 16.415912 31.325831 20.085579 20.254784 5.753327 22.564221 NaN 44.503437 21.097421
2000-01-06 31.877325 33.959430 2.996290 28.553331 15.459153 22.903343 39.781569 17.210031 24.899860 NaN ... 4.677733 16.972739 32.438168 20.122232 20.998392 5.964159 22.449405 NaN 45.126952 20.527220
2000-01-07 32.509812 34.433913 3.138219 29.382213 15.962182 23.305926 42.128682 18.342270 24.506249 NaN ... 4.677733 18.123166 35.023602 20.922479 21.830687 6.662948 22.282692 NaN 48.535033 21.051805

5 rows × 30 columns

# types
set_option('display.max_rows', 500)
dataset.dtypes
MMM     float64
AXP     float64
AAPL    float64
BA      float64
CAT     float64
CVX     float64
CSCO    float64
KO      float64
DIS     float64
DWDP    float64
XOM     float64
GS      float64
HD      float64
IBM     float64
INTC    float64
JNJ     float64
JPM     float64
MCD     float64
MRK     float64
MSFT    float64
NKE     float64
PFE     float64
PG      float64
TRV     float64
UTX     float64
UNH     float64
VZ      float64
V       float64
WMT     float64
WBA     float64
dtype: object
# describe data
set_option('precision', 3)
dataset.describe()
MMM AXP AAPL BA CAT CVX CSCO KO DIS DWDP ... NKE PFE PG TRV UTX UNH VZ V WMT WBA
count 4804.000 4804.000 4804.000 4804.000 4804.000 4804.000 4804.000 4804.000 4804.000 363.000 ... 4804.000 4804.000 4804.000 4804.000 4804.000 4804.000 4804.000 2741.000 4804.000 4804.000
mean 86.769 49.659 49.107 85.482 56.697 61.735 21.653 24.984 46.368 64.897 ... 23.724 20.737 49.960 55.961 62.209 64.418 27.193 53.323 50.767 41.697
std 53.942 22.564 55.020 79.085 34.663 31.714 10.074 10.611 32.733 5.768 ... 20.988 7.630 19.769 34.644 32.627 62.920 11.973 37.647 17.040 19.937
min 25.140 8.713 0.828 17.463 9.247 17.566 6.842 11.699 11.018 49.090 ... 2.595 8.041 16.204 13.287 14.521 5.175 11.210 9.846 30.748 17.317
25% 51.192 34.079 3.900 37.407 26.335 31.820 14.910 15.420 22.044 62.250 ... 8.037 15.031 35.414 29.907 34.328 23.498 17.434 18.959 38.062 27.704
50% 63.514 42.274 23.316 58.437 53.048 56.942 18.578 20.563 29.521 66.586 ... 14.147 18.643 46.735 39.824 55.715 42.924 21.556 45.207 42.782 32.706
75% 122.906 66.816 84.007 112.996 76.488 91.688 24.650 34.927 75.833 69.143 ... 36.545 25.403 68.135 80.767 92.557 73.171 38.996 76.966 65.076 58.165
max 251.981 112.421 231.260 411.110 166.832 128.680 63.698 50.400 117.973 75.261 ... 85.300 45.841 98.030 146.564 141.280 286.330 60.016 150.525 107.010 90.188

8 rows × 30 columns

## 3.2. Data Visualization

Taking a look at the correlation. More detailed look at the data will be performed after implementing the Dimensionality Reduction Models.

# correlation
correlation = dataset.corr()
plt.figure(figsize=(15,15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
<matplotlib.axes._subplots.AxesSubplot at 0x1e1b1d9eeb8>
CaseStudies/output_20_1.png

As it can be seen by the chart above, there is a significant positive correlation between the stocks.

## 4. Data Preparation

## 4.1. Data Cleaning Let us check for the NAs in the rows, either drop

them or fill them with the mean of the column

#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Null Values = True

Getting rid of the columns with more than 30% missing values.

missing_fractions = dataset.isnull().mean().sort_values(ascending=False)

missing_fractions.head(10)

drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))

dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape
(4804, 28)

Given that there are null values drop the rown contianing the null values.

# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')

# Drop the rows containing NA
dataset= dataset.dropna(axis=0)
# Fill na with 0
#dataset.fillna('0')

dataset.head(2)
MMM AXP AAPL BA CAT CVX CSCO KO DIS XOM ... MSFT NKE PFE PG TRV UTX UNH VZ WMT WBA
Date
2000-01-03 29.847 35.477 3.531 26.65 14.561 21.582 43.004 16.984 23.522 23.862 ... 38.135 4.701 16.747 32.228 20.159 21.319 5.841 22.564 47.338 21.713
2000-01-04 28.661 34.134 3.233 26.61 14.372 21.582 40.577 17.041 24.900 23.405 ... 36.846 4.445 16.122 31.596 19.890 20.446 5.766 21.834 45.566 20.907

2 rows × 28 columns

Computing Daily Return

   # Daily Log Returns (%)
   # datareturns = np.log(data / data.shift(1))

   # Daily Linear Returns (%)
   datareturns = dataset.pct_change(1)

   #Remove Outliers beyong 3 standard deviation
   datareturns= datareturns[datareturns.apply(lambda x :(x-x.mean()).abs()<(3*x.std()) ).all(1)]

## 4.2. Data Transformation

All the variables should be on the same scale before applying PCA, otherwise a feature with large values will dominate the result. Below we use StandardScaler in sklearn to standardize the dataset’s features onto unit scale (mean = 0 and variance = 1).

Standardization is a useful technique to transform attributes to a standard Normal distribution with a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(datareturns)
rescaledDataset = pd.DataFrame(scaler.fit_transform(datareturns),columns = datareturns.columns, index = datareturns.index)
# summarize transformed data
datareturns.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
rescaledDataset.head(2)
MMM AXP AAPL BA CAT CVX CSCO KO DIS XOM ... MSFT NKE PFE PG TRV UTX UNH VZ WMT WBA
Date
2000-01-11 -1.713 0.566 -2.708 -1.133 -1.041 -0.787 -1.834 3.569 0.725 0.981 ... -1.936 3.667 -0.173 1.772 -0.936 -1.954 0.076 -0.836 -1.375 2.942
2000-01-20 -3.564 1.077 3.304 -1.670 -2.834 -0.446 0.022 0.987 -2.415 -1.897 ... -0.733 -1.816 -1.421 -2.742 -0.476 -1.916 1.654 0.241 -0.987 -0.036

2 rows × 28 columns

# Visualizing Log Returns for the DJIA
plt.figure(figsize=(16, 5))
plt.title("AAPL Return")
plt.ylabel("Return")
rescaledDataset.AAPL.plot()
plt.grid(True);
plt.legend()
plt.show()

The portfolio is divided into train and test split to perform the analysis regarding the best porfolio and backtesting shown later.

   # Dividing the dataset into training and testing sets
   percentage = int(len(rescaledDataset) * 0.8)
   X_train = rescaledDataset[:percentage]
   X_test = rescaledDataset[percentage:]

   X_train_raw = datareturns[:percentage]
   X_test_raw = datareturns[percentage:]


   stock_tickers = rescaledDataset.columns.values
   n_tickers = len(stock_tickers)

## 5.2. Model Evaluation- Applying Principle Component Analysis

As this step, we create a function to compute principle component analysis from sklearn. This function computes an inversed elbow chart that shows the amount of principle components and how many of them explain the variance treshold.

pca = PCA()
PrincipalComponent=pca.fit(X_train)

First Principal Component /Eigenvector

pca.components_[0]
   array([-0.2278224 , -0.22835766, -0.15302828, -0.18969933, -0.20200012,
          -0.17810558, -0.19508121, -0.16845303, -0.20820442, -0.19308548,
          -0.20879404, -0.20231768, -0.19939638, -0.19521427, -0.16686975,
          -0.22806024, -0.15153408, -0.169941  , -0.19367262, -0.17118841,
          -0.18993347, -0.16805969, -0.197612  , -0.22658993, -0.13821257,
          -0.16688803, -0.16897835, -0.16070821])



## 5.2.1.Explained Variance using PCA
NumEigenvalues=10
fig, axes = plt.subplots(ncols=2, figsize=(14,4))
Series1 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).sort_values()*100
Series2 = pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).cumsum()*100
Series1.plot.barh(ylim=(0,9), label="woohoo",title='Explained Variance Ratio by Top 10 factors',ax=axes[0]);
Series2.plot(ylim=(0,100),xlim=(0,9),ax=axes[1], title='Cumulative Explained Variance by factor');
# explained_variance
pd.Series(np.cumsum(pca.explained_variance_ratio_)).to_frame('Explained Variance').head(NumEigenvalues).style.format('{:,.2%}'.format)
Explained Variance
0 37.03%
1 42.75%
2 47.10%
3 51.08%
4 54.60%
5 57.74%
6 60.65%
7 63.44%
8 66.18%
9 68.71%
CaseStudies/output_45_1.png

We find that the most important factor explains around 40% of the daily return variation. The dominant factor is usually interpreted as ‘the market’, depending on the results of closer inspection.

The plot on the right shows the cumulative explained variance and indicates that around 10 factors explain 73% of the returns of this large cross-section of stocks.

## 5.2.2.Looking at Portfolio weights

We compute several functions to determine the weights of each principle component. We then visualize a scatterplot that visualizes an organized descending plot with the respective weight of every company at the current chosen principle component.

def PCWeights():
    '''
    Principal Components (PC) weights for each 28 PCs
    '''
    weights = pd.DataFrame()

    for i in range(len(pca.components_)):
        weights["weights_{}".format(i)] = pca.components_[i] / sum(pca.components_[i])

    weights = weights.values.T
    return weights

weights=PCWeights()
weights[0]
array([0.04341287, 0.04351486, 0.02916042, 0.0361483 , 0.03849228,
       0.03393904, 0.03717385, 0.03209969, 0.03967455, 0.03679355,
       0.0397869 , 0.0385528 , 0.03799613, 0.0371992 , 0.03179799,
       0.04345819, 0.02887569, 0.03238323, 0.03690543, 0.03262094,
       0.03619291, 0.03202474, 0.0376561 , 0.04317801, 0.0263372 ,
       0.03180147, 0.0321998 , 0.03062387])
pca.components_[0]
array([-0.2278224 , -0.22835766, -0.15302828, -0.18969933, -0.20200012,
       -0.17810558, -0.19508121, -0.16845303, -0.20820442, -0.19308548,
       -0.20879404, -0.20231768, -0.19939638, -0.19521427, -0.16686975,
       -0.22806024, -0.15153408, -0.169941  , -0.19367262, -0.17118841,
       -0.18993347, -0.16805969, -0.197612  , -0.22658993, -0.13821257,
       -0.16688803, -0.16897835, -0.16070821])
weights[0]
array([0.04341287, 0.04351486, 0.02916042, 0.0361483 , 0.03849228,
       0.03393904, 0.03717385, 0.03209969, 0.03967455, 0.03679355,
       0.0397869 , 0.0385528 , 0.03799613, 0.0371992 , 0.03179799,
       0.04345819, 0.02887569, 0.03238323, 0.03690543, 0.03262094,
       0.03619291, 0.03202474, 0.0376561 , 0.04317801, 0.0263372 ,
       0.03180147, 0.0321998 , 0.03062387])
NumComponents=5

topPortfolios = pd.DataFrame(pca.components_[:NumComponents], columns=dataset.columns)
eigen_portfolios = topPortfolios.div(topPortfolios.sum(1), axis=0)
eigen_portfolios.index = [f'Portfolio {i}' for i in range( NumComponents)]
np.sqrt(pca.explained_variance_)
eigen_portfolios.T.plot.bar(subplots=True, layout=(int(NumComponents),1), figsize=(14,10), legend=False, sharey=True, ylim= (-1,1))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B79E3208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B7828048>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B78FA320>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B798D668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E1B7983860>]],
      dtype=object)
CaseStudies/output_53_1.png
# plotting heatmap
sns.heatmap(topPortfolios)
<matplotlib.axes._subplots.AxesSubplot at 0x1e1b4410898>
CaseStudies/output_54_1.png

The heatmap and the plot above shown the contribution of different stocks in each eigenvector.

## 5.2.3. Finding the Best Eigen Portfolio

In order to find the best eigen portfolios and perform backtesting in the next step, we use the sharpe ratio, which is a performance metric that explains the annualized returns against the annualized volatility of each company in a portfolio. A high sharpe ratio explains higher returns and/or lower volatility for the specified portfolio. The annualized sharpe ratio is computed by dividing the annualized returns against the annualized volatility. For annualized return we apply the geometric average of all the returns in respect to the periods per year (days of operations in the exchange in a year). Annualized volatility is computed by taking the standard deviation of the returns and multiplying it by the square root of the peri‐ ods per year.

# Sharpe Ratio
def sharpe_ratio(ts_returns, periods_per_year=252):
    '''
    Sharpe ratio is the average return earned in excess of the risk-free rate per unit of volatility or total risk.
    It calculares the annualized return, annualized volatility, and annualized sharpe ratio.

    ts_returns are  returns of a signle eigen portfolio.
    '''
    n_years = ts_returns.shape[0]/periods_per_year
    annualized_return = np.power(np.prod(1+ts_returns),(1/n_years))-1
    annualized_vol = ts_returns.std() * np.sqrt(periods_per_year)
    annualized_sharpe = annualized_return / annualized_vol

    return annualized_return, annualized_vol, annualized_sharpe

We construct a loop to compute the principle component’s weights for each eigen portfolio, which then uses the sharpe ratio function to look for the portfolio with the highest sharpe ratio. Once we know which portfolio has the highest sharpe ratio, we can visualize its performance against the DJIA Index for comparison.

def optimizedPortfolio():
    n_portfolios = len(pca.components_)
    annualized_ret = np.array([0.] * n_portfolios)
    sharpe_metric = np.array([0.] * n_portfolios)
    annualized_vol = np.array([0.] * n_portfolios)
    highest_sharpe = 0
    stock_tickers = rescaledDataset.columns.values
    n_tickers = len(stock_tickers)
    pcs = pca.components_

    for i in range(n_portfolios):

        pc_w = pcs[i] / sum(pcs[i])
        eigen_prtfi = pd.DataFrame(data ={'weights': pc_w.squeeze()*100}, index = stock_tickers)
        eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)
        eigen_prti_returns = np.dot(X_train_raw.loc[:, eigen_prtfi.index], pc_w)
        eigen_prti_returns = pd.Series(eigen_prti_returns.squeeze(), index=X_train_raw.index)
        er, vol, sharpe = sharpe_ratio(eigen_prti_returns)
        annualized_ret[i] = er
        annualized_vol[i] = vol
        sharpe_metric[i] = sharpe

        sharpe_metric= np.nan_to_num(sharpe_metric)

    # find portfolio with the highest Sharpe ratio
    highest_sharpe = np.argmax(sharpe_metric)

    print('Eigen portfolio #%d with the highest Sharpe. Return %.2f%%, vol = %.2f%%, Sharpe = %.2f' %
          (highest_sharpe,
           annualized_ret[highest_sharpe]*100,
           annualized_vol[highest_sharpe]*100,
           sharpe_metric[highest_sharpe]))


    fig, ax = plt.subplots()
    fig.set_size_inches(12, 4)
    ax.plot(sharpe_metric, linewidth=3)
    ax.set_title('Sharpe ratio of eigen-portfolios')
    ax.set_ylabel('Sharpe ratio')
    ax.set_xlabel('Portfolios')

    results = pd.DataFrame(data={'Return': annualized_ret, 'Vol': annualized_vol, 'Sharpe': sharpe_metric})
    results.dropna(inplace=True)
    results.sort_values(by=['Sharpe'], ascending=False, inplace=True)
    print(results.head(20))

    plt.show()

optimizedPortfolio()
Eigen portfolio #0 with the highest Sharpe. Return 11.47%, vol = 13.31%, Sharpe = 0.86
    Return    Vol  Sharpe
0    0.115  0.133   0.862
7    0.096  0.693   0.138
5    0.100  0.845   0.118
1    0.057  0.670   0.084
2   -0.107  0.859  -0.124
11  -1.000  7.228  -0.138
13  -0.399  2.070  -0.193
25  -1.000  5.009  -0.200
23  -1.000  4.955  -0.202
6   -0.416  1.967  -0.212
10  -0.158  0.738  -0.213
3   -0.162  0.738  -0.220
26  -1.000  4.535  -0.220
8   -0.422  1.397  -0.302
17  -0.998  3.277  -0.305
24  -0.550  1.729  -0.318
16  -0.980  3.038  -0.323
21  -0.470  1.420  -0.331
14  -0.886  2.571  -0.345
27  -0.933  2.606  -0.358
CaseStudies/output_60_1.png

As shown from the results above, the portfolio 12 is the best portfolio and has the maximum sharp ratio out of all the porfolio. Let us look at the composition of this portfolio.

weights = PCWeights()
portfolio = portfolio = pd.DataFrame()

def plotEigen(weights, plot=False, portfolio=portfolio):
    portfolio = pd.DataFrame(data ={'weights': weights.squeeze()*100}, index = stock_tickers)
    portfolio.sort_values(by=['weights'], ascending=False, inplace=True)
    if plot:
        print('Sum of weights of current eigen-portfolio: %.2f' % np.sum(portfolio))
        portfolio.plot(title='Current Eigen-Portfolio Weights',
            figsize=(12,6),
            xticks=range(0, len(stock_tickers),1),
            rot=45,
            linewidth=3
            )
        plt.show()


    return portfolio

# Weights are stored in arrays, where 0 is the first PC's weights.
plotEigen(weights=weights[0], plot=True)
Sum of weights of current eigen-portfolio: 100.00
CaseStudies/output_62_1.png
weights
AXP 4.351
JPM 4.346
MMM 4.341
UTX 4.318
GS 3.979
DIS 3.967
HD 3.855
CAT 3.849
IBM 3.800
TRV 3.766
INTC 3.720
CSCO 3.717
MSFT 3.691
XOM 3.679
PFE 3.619
BA 3.615
CVX 3.394
NKE 3.262
MRK 3.238
WMT 3.220
KO 3.210
PG 3.202
VZ 3.180
JNJ 3.180
WBA 3.062
AAPL 2.916
MCD 2.888
UNH 2.634

The chart shows the allocation of the best portfolio. The weights in the chart are in percentages.

## 5.2.4. Backtesting Eigenportfolio

We will now try to backtest this algorithm on the test set, by looking at few top and bottom portfolios.

def Backtest(eigen):

    '''

    Plots Principle components returns against real returns.

    '''

    eigen_prtfi = pd.DataFrame(data ={'weights': eigen.squeeze()}, index = stock_tickers)
    eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)

    eigen_prti_returns = np.dot(X_test_raw.loc[:, eigen_prtfi.index], eigen)
    eigen_portfolio_returns = pd.Series(eigen_prti_returns.squeeze(), index=X_test_raw.index)
    returns, vol, sharpe = sharpe_ratio(eigen_portfolio_returns)
    print('Current Eigen-Portfolio:\nReturn = %.2f%%\nVolatility = %.2f%%\nSharpe = %.2f' % (returns*100, vol*100, sharpe))
    equal_weight_return=(X_test_raw * (1/len(pca.components_))).sum(axis=1)
    df_plot = pd.DataFrame({'EigenPorfolio Return': eigen_portfolio_returns, 'Equal Weight Index': equal_weight_return}, index=X_test.index)
    np.cumprod(df_plot + 1).plot(title='Returns of the equal weighted index vs. eigen-portfolio' ,
                          figsize=(12,6), linewidth=3)
    plt.show()

Backtest(eigen=weights[5])
Backtest(eigen=weights[1])
Backtest(eigen=weights[14])
Current Eigen-Portfolio:
Return = 32.76%
Volatility = 68.64%
Sharpe = 0.48
CaseStudies/output_66_1.png
Current Eigen-Portfolio:
Return = 99.80%
Volatility = 58.34%
Sharpe = 1.71
CaseStudies/output_66_3.png
Current Eigen-Portfolio:
Return = -79.42%
Volatility = 185.30%
Sharpe = -0.43
CaseStudies/output_66_5.png

As shown in charts above the eigen portfolio return of the top portfolios outperform the equally weighted portfolio and the eigen portfolio ranked 19 underperformed the market significantly in the test set.

Conclusion

In terms of the intuition behind the eigen portfolios, we demonstrated that the first eigen portfolio represents a systematic risk factor and other eigen portfolio may represent sector or industry factor. We discuss diversification benefits offered by the eigen portfolios as they are derived using PCA and are independent.

Looking at the backtesting result, the portfolio with the best result in the training set leads to the best result in the test set. By using PCA, we get independent eigen portfo‐ lios with higher return and sharp ratio as compared to market.