Pairs Trading- finding pairs based on Clustering

In this case study, we will use clustering methods to select pairs for a pairs trading strategy.

Content

# 1. Problem Definition

Our goal in this case study is to perform clustering analysis on the stocks of S&P500 and come up with pairs for a pairs trading strategy.

The data of the stocks of S&P 500, obtained using pandas_datareader from yahoo finance. It includes price data from 2018 onwards.

# 2. Getting Started- Loading the data and python packages

## 2.1. Loading the python packages

   # Load libraries
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   from pandas import read_csv, set_option
   from pandas.plotting import scatter_matrix
   import seaborn as sns
   from sklearn.preprocessing import StandardScaler
   import datetime
   import pandas_datareader as dr

   #Import Model Packages
   from sklearn.cluster import KMeans, AgglomerativeClustering,AffinityPropagation, DBSCAN
   from scipy.cluster.hierarchy import fcluster
   from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
   from scipy.spatial.distance import pdist
   from sklearn.metrics import adjusted_mutual_info_score
   from sklearn import cluster, covariance, manifold


   #Other Helper Packages and functions
   import matplotlib.ticker as ticker
   from itertools import cycle

## 2.2. Loading the Data

#The data already obtained from yahoo finance is imported.
dataset = read_csv('SP500Data.csv',index_col=0)

#Diable the warnings
import warnings
warnings.filterwarnings('ignore')

type(dataset)

   pandas.core.frame.DataFrame

# 3. Exploratory Data Analysis

## 3.1. Descriptive Statistics

# shape
dataset.shape

(448, 502)

# peek at data
set_option('display.width', 100)
dataset.head(5)

	ABT	ABBV	ABMD	ACN	ATVI	ADBE	AMD	AAP	AES	AMG	...	WLTW	WYNN	XEL	XRX	XLNX	XYL	YUM	ZBH	ZION	ZTS
Date
2018-01-02	58.790001	98.410004	192.490005	153.839996	64.309998	177.699997	10.98	106.089996	10.88	203.039993	...	146.990005	164.300003	47.810001	29.370001	67.879997	68.070000	81.599998	124.059998	50.700001	71.769997
2018-01-03	58.919998	99.949997	195.820007	154.550003	65.309998	181.039993	11.55	107.050003	10.87	202.119995	...	149.740005	162.520004	47.490002	29.330000	69.239998	68.900002	81.529999	124.919998	50.639999	72.099998
2018-01-04	58.820000	99.379997	199.250000	156.380005	64.660004	183.220001	12.12	111.000000	10.83	198.539993	...	151.259995	163.399994	47.119999	29.690001	70.489998	69.360001	82.360001	124.739998	50.849998	72.529999
2018-01-05	58.990002	101.110001	202.320007	157.669998	66.370003	185.339996	11.88	112.180000	10.87	199.470001	...	152.229996	164.490005	46.790001	29.910000	74.150002	69.230003	82.839996	125.980003	50.869999	73.360001
2018-01-08	58.820000	99.489998	207.800003	158.929993	66.629997	185.039993	12.28	111.389999	10.87	200.529999	...	151.410004	162.300003	47.139999	30.260000	74.639999	69.480003	82.980003	126.220001	50.619999	74.239998

5 rows × 502 columns

# describe data
set_option('precision', 3)
dataset.describe()

	MMM	AXP	AAPL	BA	CAT	CVX	CSCO	KO	DIS	DWDP	...	NKE	PFE	PG	TRV	UTX	UNH	VZ	V	WMT	WBA
count	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	363.000	...	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	2741.000	4804.000	4804.000
mean	86.769	49.659	49.107	85.482	56.697	61.735	21.653	24.984	46.368	64.897	...	23.724	20.737	49.960	55.961	62.209	64.418	27.193	53.323	50.767	41.697
std	53.942	22.564	55.020	79.085	34.663	31.714	10.074	10.611	32.733	5.768	...	20.988	7.630	19.769	34.644	32.627	62.920	11.973	37.647	17.040	19.937
min	25.140	8.713	0.828	17.463	9.247	17.566	6.842	11.699	11.018	49.090	...	2.595	8.041	16.204	13.287	14.521	5.175	11.210	9.846	30.748	17.317
25%	51.192	34.079	3.900	37.407	26.335	31.820	14.910	15.420	22.044	62.250	...	8.037	15.031	35.414	29.907	34.328	23.498	17.434	18.959	38.062	27.704
50%	63.514	42.274	23.316	58.437	53.048	56.942	18.578	20.563	29.521	66.586	...	14.147	18.643	46.735	39.824	55.715	42.924	21.556	45.207	42.782	32.706
75%	122.906	66.816	84.007	112.996	76.488	91.688	24.650	34.927	75.833	69.143	...	36.545	25.403	68.135	80.767	92.557	73.171	38.996	76.966	65.076	58.165
max	251.981	112.421	231.260	411.110	166.832	128.680	63.698	50.400	117.973	75.261	...	85.300	45.841	98.030	146.564	141.280	286.330	60.016	150.525	107.010	90.188

8 rows × 30 columns

## 3.2. Data Visualization

We will take a detailed look into the visualization post clustering.

## 4. Data Preparation

## 4.1. Data Cleaning We check for the NAs in the rows, either drop

them or fill them with the mean of the column.

#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())

Null Values = True

Getting rid of the columns with more than 30% missing values.

missing_fractions = dataset.isnull().mean().sort_values(ascending=False)

missing_fractions.head(10)

drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))

dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape

(448, 498)

Given that there are null values drop the rown contianing the null values.

# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')
dataset.head(2)

	ABT	ABBV	ABMD	ACN	ATVI	ADBE	AMD	AAP	AES	AMG	...	WLTW	WYNN	XEL	XRX	XLNX	XYL	YUM	ZBH	ZION	ZTS
Date
2018-01-02	58.790001	98.410004	192.490005	153.839996	64.309998	177.699997	10.98	106.089996	10.88	203.039993	...	146.990005	164.300003	47.810001	29.370001	67.879997	68.070000	81.599998	124.059998	50.700001	71.769997
2018-01-03	58.919998	99.949997	195.820007	154.550003	65.309998	181.039993	11.55	107.050003	10.87	202.119995	...	149.740005	162.520004	47.490002	29.330000	69.239998	68.900002	81.529999	124.919998	50.639999	72.099998

2 rows × 498 columns

## 4.2. Data Transformation

For the purpose of clustering, we will be using annual returns and variance as the variables as they are the indicators of the stock performance and its volatility. Let us prepare the return and volatility variables from the data.

#Calculate average annual percentage return and volatilities over a theoretical one year period
returns = dataset.pct_change().mean() * 252
returns = pd.DataFrame(returns)
returns.columns = ['Returns']
returns['Volatility'] = dataset.pct_change().std() * np.sqrt(252)
data=returns
#format the data as a numpy array to feed into the K-Means algorithm
#data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T

All the variables should be on the same scale before applying clustering, otherwise a feature with large values will dominate the result. We use StandardScaler in sklearn to standardize the dataset’s features onto unit scale (mean = 0 and variance = 1).

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
rescaledDataset = pd.DataFrame(scaler.fit_transform(data),columns = data.columns, index = data.index)
# summarize transformed data
rescaledDataset.head(2)
X=rescaledDataset
X.head(2)

	Returns	Volatility
ABT	0.794067	-0.702741
ABBV	-0.927603	0.794867

The parameters to clusters are the indices and the variables used in the clustering are the columns. Hence the data is in the right format to be fed to the clustering algorithms

# 5. Evaluate Algorithms and Models

We will look at the following models:

KMeans
Hierarchical Clustering (Agglomerative Clustering)
Affinity Propagation

## 5.1. K-Means Clustering

### 5.1.1. Finding optimal number of clusters

In this step we look at the following metrices:

Sum of square errors (SSE) within clusters
Silhouette score.

distorsions = []
max_loop=20
for k in range(2, max_loop):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    distorsions.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, max_loop), distorsions)
plt.xticks([i for i in range(2, max_loop)], rotation=75)
plt.grid(True)

Inspecting the sum of squared errors chart, it appears the elbow “kink” occurs 5 or 6 clusters for this data. Certainly, we can see that as the number of clusters increase pass 6, the sum of square of errors within clusters plateaus off.

Silhouette score

from sklearn import metrics

silhouette_score = []
for k in range(2, max_loop):
        kmeans = KMeans(n_clusters=k,  random_state=10, n_init=10, n_jobs=-1)
        kmeans.fit(X)
        silhouette_score.append(metrics.silhouette_score(X, kmeans.labels_, random_state=10))
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, max_loop), silhouette_score)
plt.xticks([i for i in range(2, max_loop)], rotation=75)
plt.grid(True)

From the silhouette score chart, we can see that there are various parts of the graph where a kink can be seen. Since there is not much a difference in SSE after 6 clusters, we would prefer 6 clusters in the K-means model.

### 5.1.2. Clustering and Visualisation

Let us build the k-means model with six clusters and visualize the results.

nclust=6

#Fit with k-means
k_means = cluster.KMeans(n_clusters=nclust)
k_means.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

#Extracting labels
target_labels = k_means.predict(X)

Visualizing how your clusters are formed is no easy task when the number of variables/dimensions in your dataset is very large. One of the methods of visualising a cluster in two-dimensional space.

centroids = k_means.cluster_centers_
fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c = k_means.labels_, cmap ="rainbow", label = X.index)
ax.set_title('k-Means results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)

plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=11)

[<matplotlib.lines.Line2D at 0x1532b4d6710>]

Let us check the elements of the clusters

# show number of stocks in each cluster
clustered_series = pd.Series(index=X.index, data=k_means.labels_.flatten())
# clustered stock with its cluster label
clustered_series_all = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series = clustered_series[clustered_series != -1]

plt.figure(figsize=(12,7))
plt.barh(
    range(len(clustered_series.value_counts())), # cluster labels, y axis
    clustered_series.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number')
plt.show()

The number of stocks in a cluster range from around 40 to 120. Although, the distribution is not equal, we have significant number of stocks in each cluster.

## 5.2. Hierarchical Clustering (Agglomerative Clustering)

In the first step we look at the hierarchy graph and check for the number of clusters

### 5.2.1. Building Hierarchy Graph/ Dendogram

The hierarchy class has a dendrogram method which takes the value returned by the linkage method of the same class. The linkage method takes the dataset and the method to minimize distances as parameters. We use ‘ward’ as the method since it minimizes then variants of distances between the clusters.

from scipy.cluster.hierarchy import dendrogram, linkage, ward

#Calulate linkage
Z= linkage(X, method='ward')
Z[0]

array([3.30000000e+01, 3.14000000e+02, 3.62580431e-03, 2.00000000e+00])

The best way to visualize an agglomerate clustering algorithm is through a dendogram, which displays a cluster tree, the leaves being the individual stocks and the root being the final single cluster. The “distance” between each cluster is shown on the y-axis, and thus the longer the branches are, the less correlated two clusters are.

#Plot Dendogram
plt.figure(figsize=(10, 7))
plt.title("Stocks Dendrograms")
dendrogram(Z,labels = X.index)
plt.show()

Once one big cluster is formed, the longest vertical distance without any horizontal line passing through it is selected and a horizontal line is drawn through it. The number of vertical lines this newly created horizontal line passes is equal to number of clusters. Then we select the distance threshold to cut the dendrogram to obtain the selected clustering level. The output is the cluster labelled for each row of data. As expected from the dendrogram, a cut at 13 gives us four clusters.

distance_threshold = 13
clusters = fcluster(Z, distance_threshold, criterion='distance')
chosen_clusters = pd.DataFrame(data=clusters, columns=['cluster'])
chosen_clusters['cluster'].unique()

   array([1, 4, 3, 2], dtype=int64)

### 5.2.2. Clustering and Visualisation

nclust = 4
hc = AgglomerativeClustering(n_clusters=nclust, affinity = 'euclidean', linkage = 'ward')
clust_labels1 = hc.fit_predict(X)

fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c =clust_labels1, cmap ="rainbow")
ax.set_title('Hierarchical Clustering')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)

<matplotlib.colorbar.Colorbar at 0x1fa81d717f0>

Similar to the plot of k-means clustering, we see that there are some distinct clusters separated by different colors.

## 5.3. Affinity Propagation

ap = AffinityPropagation()
ap.fit(X)
clust_labels2 = ap.predict(X)

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c =clust_labels2, cmap ="rainbow")
ax.set_title('Affinity')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)

<matplotlib.colorbar.Colorbar at 0x1498fae5390>

Similar to the plot of k-means clustering, we see that there are some distinct clusters separated by different colors.

### 5.3.1 Cluster Visualisation

cluster_centers_indices = ap.cluster_centers_indices_
labels = ap.labels_

no_clusters = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % no_clusters)
# Plot exemplars

X_temp=np.asarray(X)
plt.close('all')
plt.figure(1)
plt.clf()

fig = plt.figure(figsize=(8,6))
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(no_clusters), colors):
    class_members = labels == k
    cluster_center = X_temp[cluster_centers_indices[k]]
    plt.plot(X_temp[class_members, 0], X_temp[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14)
    for x in X_temp[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.show()

Estimated number of clusters: 27

<Figure size 432x288 with 0 Axes>

# show number of stocks in each cluster
clustered_series_ap = pd.Series(index=X.index, data=ap.labels_.flatten())
# clustered stock with its cluster label
clustered_series_all_ap = pd.Series(index=X.index, data=ap.labels_.flatten())
clustered_series_ap = clustered_series_ap[clustered_series != -1]

plt.figure(figsize=(12,7))
plt.barh(
    range(len(clustered_series_ap.value_counts())), # cluster labels, y axis
    clustered_series_ap.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number')
plt.show()

If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

from sklearn import metrics
print("km", metrics.silhouette_score(X, k_means.labels_, metric='euclidean'))
print("hc", metrics.silhouette_score(X, hc.fit_predict(X), metric='euclidean'))
print("ap", metrics.silhouette_score(X, ap.labels_, metric='euclidean'))

km 0.3350720873411941
hc 0.3432149515640865
ap 0.3450647315156527

Given the affinity propagation performs the best, we go ahead with the affinity propagation and use 27 clusters as specified by this clustering method

Visualising the return within a cluster

The understand the intuition behind clustering, let us visualize the results of the clusters.

# all stock with its cluster label (including -1)
clustered_series = pd.Series(index=X.index, data=ap.fit_predict(X).flatten())
# clustered stock with its cluster label
clustered_series_all = pd.Series(index=X.index, data=ap.fit_predict(X).flatten())
clustered_series = clustered_series[clustered_series != -1]

# get the number of stocks in each cluster
counts = clustered_series_ap.value_counts()

# let's visualize some clusters
cluster_vis_list = list(counts[(counts<25) & (counts>1)].index)[::-1]
cluster_vis_list

[11, 25, 16, 20, 15, 2, 0, 5, 19, 17, 22, 21, 24, 10, 9, 13]

CLUSTER_SIZE_LIMIT = 9999
counts = clustered_series.value_counts()
ticker_count_reduced = counts[(counts>1) & (counts<=CLUSTER_SIZE_LIMIT)]
print ("Clusters formed: %d" % len(ticker_count_reduced))
print ("Pairs to evaluate: %d" % (ticker_count_reduced*(ticker_count_reduced-1)).sum())

Clusters formed: 26
Pairs to evaluate: 12166

# plot a handful of the smallest clusters
plt.figure(figsize=(12,7))
cluster_vis_list[0:min(len(cluster_vis_list), 4)]

[11, 25, 16, 20]

<Figure size 864x504 with 0 Axes>

for clust in cluster_vis_list[0:min(len(cluster_vis_list), 4)]:
    tickers = list(clustered_series[clustered_series==clust].index)
    means = np.log(dataset.loc[:"2018-02-01", tickers].mean())
    data = np.log(dataset.loc[:"2018-02-01", tickers]).sub(means)
    data.plot(title='Stock Time Series for Cluster %d' % clust)
plt.show()

Looking at the charts above, across all the clusters with small number of stocks, we see similar movement of the stocks under different clusters, which corroborates the effectiveness of the clustering technique.

# 6. Pairs Selection

## 6.1. Cointegration and Pair Selection Function

def find_cointegrated_pairs(data, significance=0.05):
    # This function is from https://www.quantopian.com/lectures/introduction-to-pairs-trading
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(1):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < significance:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

from statsmodels.tsa.stattools import coint
cluster_dict = {}
for i, which_clust in enumerate(ticker_count_reduced.index):
    tickers = clustered_series[clustered_series == which_clust].index
    score_matrix, pvalue_matrix, pairs = find_cointegrated_pairs(
        dataset[tickers]
    )
    cluster_dict[which_clust] = {}
    cluster_dict[which_clust]['score_matrix'] = score_matrix
    cluster_dict[which_clust]['pvalue_matrix'] = pvalue_matrix
    cluster_dict[which_clust]['pairs'] = pairs

pairs = []
for clust in cluster_dict.keys():
    pairs.extend(cluster_dict[clust]['pairs'])

print ("Number of pairs found : %d" % len(pairs))
print ("In those pairs, there are %d unique tickers." % len(np.unique(pairs)))

Number of pairs found : 32
In those pairs, there are 47 unique tickers.

pairs

   [('AOS', 'FITB'),
    ('AOS', 'ZION'),
    ('AIG', 'TEL'),
    ('ABBV', 'BWA'),
    ('AFL', 'ARE'),
    ('AFL', 'ED'),
    ('AFL', 'MMC'),
    ('AFL', 'WM'),
    ('ACN', 'EQIX'),
    ('A', 'WAT'),
    ('ADBE', 'ADI'),
    ('ADBE', 'CDNS'),
    ('ADBE', 'VFC'),
    ('ABT', 'AZO'),
    ('ABT', 'CHD'),
    ('ABT', 'IQV'),
    ('ABT', 'WELL'),
    ('ALL', 'GL'),
    ('MO', 'CCL'),
    ('ALB', 'CTL'),
    ('ALB', 'FANG'),
    ('ALB', 'EOG'),
    ('ALB', 'HP'),
    ('ALB', 'NOV'),
    ('ALB', 'PVH'),
    ('ALB', 'TPR'),
    ('ADSK', 'ULTA'),
    ('ADSK', 'XLNX'),
    ('AAL', 'FCX'),
    ('CMG', 'EW'),
    ('CMG', 'KEYS'),
    ('XEC', 'DXC')]



## 6.2. Pair Visualization

from sklearn.manifold import TSNE
import matplotlib.cm as cm
stocks = np.unique(pairs)
X_df = pd.DataFrame(index=X.index, data=X).T

in_pairs_series = clustered_series.loc[stocks]
stocks = list(np.unique(pairs))
X_pairs = X_df.T.loc[stocks]

X_tsne = TSNE(learning_rate=50, perplexity=3, random_state=1337).fit_transform(X_pairs)

plt.figure(1, facecolor='white',figsize=(16,8))
plt.clf()
plt.axis('off')
for pair in pairs:
    #print(pair[0])
    ticker1 = pair[0]
    loc1 = X_pairs.index.get_loc(pair[0])
    x1, y1 = X_tsne[loc1, :]
    #print(ticker1, loc1)

    ticker2 = pair[0]
    loc2 = X_pairs.index.get_loc(pair[1])
    x2, y2 = X_tsne[loc2, :]

    plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='gray');

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=220, alpha=0.9, c=in_pairs_series.values, cmap=cm.Paired)
plt.title('T-SNE Visualization of Validated Pairs');

# zip joins x and y coordinates in pairs
for x,y,name in zip(X_tsne[:,0],X_tsne[:,1],X_pairs.index):

    label = name

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,10), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center

plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=11)

[<matplotlib.lines.Line2D at 0x14901bdf7f0>]

Conclusion

The clustering techniques do not directly help in stock trend prediction. However, they can be effectively used in portfolio construction for finding the right pairs, which eventually help in risk mitigation and one can achieve superior risk adjusted returns.

We showed the approaches to finding the appropriate number of clusters in k-means and built a hierarchy graph in hierarchical clustering. A next step from this case study would be to explore and backtest various long/short trading strategies with pairs of stocks from the groupings of stocks.

Clustering can effectively be used for dividing stocks into groups with “similar characteristics” for many other kinds of trading strategies and can help in portfolio construction to ensure we choose a universe of stocks with sufficient diversification between them.