Dimensionality Reduction-Bitcoin Price Prediction Problem

In this case study, we will use the dimensionality reduction approach to enhance the “bitcoin trading strategy” related case study discussed in Chapter 6.

Content

1. Problem Definition
2. Getting Started - Load Libraries and Dataset
- 2.1. Load Libraries
- 2.2. Load Dataset
3. Exploratory Data Analysis
- 3.1 Descriptive Statistics
4. Data Preparation
5.Evaluate Algorithms and Models

# 1. Problem Definition

In this case study, we will use the dimensionality reduction approach to enhance the “bitcoin trading strategy” related case study discussed in Chapter 6.

The data and the variables used in this case study are same as the case study presented in the classification case study chapter. The data is the bitcoin data for the time period of Jan 2012 to October 2017, with minute to minute updates of OHLC (Open, High, Low, Close), Volume in BTC and indicated currency and weighted bitcoin price

# 2. Getting Started- Loading the data and python packages

## 2.1. Loading the python packages

   # Load libraries
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   from pandas import read_csv, set_option
   from pandas.plotting import scatter_matrix
   import seaborn as sns
   from sklearn.preprocessing import StandardScaler
   from sklearn.model_selection import train_test_split, KFold, cross_val_score
   from sklearn.ensemble import GradientBoostingClassifier

   from mpl_toolkits.mplot3d import Axes3D

   import re
   from collections import OrderedDict
   from time import time
   import sqlite3

   from scipy.linalg import svd
   from scipy import stats
   from sklearn.decomposition import TruncatedSVD
   from sklearn.manifold import TSNE

   import warnings
   warnings.filterwarnings('ignore')

   from IPython.html.widgets import interactive, fixed

## 2.2. Loading the Data

dataset = pd.read_csv(r'../../Chapter 6 - Sup. Learning - Classification models/CaseStudy3 - Bitcoin Trading Strategy/BitstampData.csv')

   #Diable the warnings
   import warnings
   warnings.filterwarnings('ignore')

# 3. Exploratory Data Analysis

## 3.1. Descriptive Statistics

# shape
dataset.shape

(2841377, 8)

# peek at data
set_option('display.width', 100)
dataset.tail(5)

	Timestamp	Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price
2841372	1496188560	2190.49	2190.49	2181.37	2181.37	1.700166	3723.784755	2190.247337
2841373	1496188620	2190.50	2197.52	2186.17	2195.63	6.561029	14402.811961	2195.206304
2841374	1496188680	2195.62	2197.52	2191.52	2191.83	15.662847	34361.023647	2193.791712
2841375	1496188740	2195.82	2216.00	2195.82	2203.51	27.090309	59913.492565	2211.620837
2841376	1496188800	2201.70	2209.81	2196.98	2208.33	9.961835	21972.308955	2205.648801

# describe data
set_option('precision', 3)
dataset.describe()

	Timestamp	Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price
count	2.841e+06	1.651e+06	1.651e+06	1.651e+06	1.651e+06	1.651e+06	1.651e+06	1.651e+06
mean	1.411e+09	4.959e+02	4.962e+02	4.955e+02	4.959e+02	1.188e+01	5.316e+03	4.959e+02
std	4.938e+07	3.642e+02	3.645e+02	3.639e+02	3.643e+02	4.094e+01	1.998e+04	3.642e+02
min	1.325e+09	3.800e+00	3.800e+00	1.500e+00	1.500e+00	0.000e+00	0.000e+00	3.800e+00
25%	1.368e+09	2.399e+02	2.400e+02	2.398e+02	2.399e+02	3.828e-01	1.240e+02	2.399e+02
50%	1.411e+09	4.200e+02	4.200e+02	4.199e+02	4.200e+02	1.823e+00	6.146e+02	4.200e+02
75%	1.454e+09	6.410e+02	6.417e+02	6.402e+02	6.410e+02	8.028e+00	3.108e+03	6.410e+02
max	1.496e+09	2.755e+03	2.760e+03	2.752e+03	2.755e+03	5.854e+03	1.866e+06	2.754e+03

# 4. Data Preparation ## 4.1. Data Cleaning

#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())

Null Values = True

Given that there are null values, we need to clean the data by filling the NaNs with the last available values.

dataset[dataset.columns.values] = dataset[dataset.columns.values].ffill()

   dataset=dataset.drop(columns=['Timestamp'])

## 4.2. Preparing the data for classification

We attach a label to each movement: * 1 if the signal is that short term price will go up as compared to the long term. * 0 if the signal is that short term price will go down as compared to the long term.

# Initialize the `signals` DataFrame with the `signal` column
#datas['PriceMove'] = 0.0

# Create short simple moving average over the short window
dataset['short_mavg'] = dataset['Close'].rolling(window=10, min_periods=1, center=False).mean()

# Create long simple moving average over the long window
dataset['long_mavg'] = dataset['Close'].rolling(window=60, min_periods=1, center=False).mean()

# Create signals
dataset['signal'] = np.where(dataset['short_mavg'] > dataset['long_mavg'], 1.0, 0.0)

dataset.tail()

	Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price	short_mavg	long_mavg	signal
2841372	2190.49	2190.49	2181.37	2181.37	1.700	3723.785	2190.247	2179.259	2189.616	0.0
2841373	2190.50	2197.52	2186.17	2195.63	6.561	14402.812	2195.206	2181.622	2189.877	0.0
2841374	2195.62	2197.52	2191.52	2191.83	15.663	34361.024	2193.792	2183.605	2189.943	0.0
2841375	2195.82	2216.00	2195.82	2203.51	27.090	59913.493	2211.621	2187.018	2190.204	0.0
2841376	2201.70	2209.81	2196.98	2208.33	9.962	21972.309	2205.649	2190.712	2190.510	1.0

## 4.3. Feature Engineering

We perform feature engineering to construct technical indicators which will be used to make the predictions, and the output variable.

The current data of the bicoin consists of date, open, high, low, close and volume. Using this data we calculate the following technical indicators: * Moving Average : A moving average provides an indication of the trend of the price movement by cut down the amount of “noise” on a price chart. * Stochastic Oscillator %K and %D : A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time. %K and %D are slow and fast indicators. * Relative Strength Index(RSI) :It is a momentum indicator that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. * Rate Of Change(ROC): It is a momentum oscillator, which measures the percentage change between the current price and the n period past price. * Momentum (MOM) : It is the rate of acceleration of a security’s price or volume – that is, the speed at which the price is changing.

#calculation of exponential moving average
def EMA(df, n):
    EMA = pd.Series(df['Close'].ewm(span=n, min_periods=n).mean(), name='EMA_' + str(n))
    return EMA
dataset['EMA10'] = EMA(dataset, 10)
dataset['EMA30'] = EMA(dataset, 30)
dataset['EMA200'] = EMA(dataset, 200)
dataset.head()

#calculation of rate of change
def ROC(df, n):
    M = df.diff(n - 1)
    N = df.shift(n - 1)
    ROC = pd.Series(((M / N) * 100), name = 'ROC_' + str(n))
    return ROC
dataset['ROC10'] = ROC(dataset['Close'], 10)
dataset['ROC30'] = ROC(dataset['Close'], 30)

#Calculation of price momentum
def MOM(df, n):
    MOM = pd.Series(df.diff(n), name='Momentum_' + str(n))
    return MOM
dataset['MOM10'] = MOM(dataset['Close'], 10)
dataset['MOM30'] = MOM(dataset['Close'], 30)

#calculation of relative strength index
def RSI(series, period):
 delta = series.diff().dropna()
 u = delta * 0
 d = u.copy()
 u[delta > 0] = delta[delta > 0]
 d[delta < 0] = -delta[delta < 0]
 u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
 u = u.drop(u.index[:(period-1)])
 d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
 d = d.drop(d.index[:(period-1)])
 rs = u.ewm(com=period-1, adjust=False).mean() / \
 d.ewm(com=period-1, adjust=False).mean()
 return 100 - 100 / (1 + rs)
dataset['RSI10'] = RSI(dataset['Close'], 10)
dataset['RSI30'] = RSI(dataset['Close'], 30)
dataset['RSI200'] = RSI(dataset['Close'], 200)

#calculation of stochastic osillator.

def STOK(close, low, high, n):
 STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
 return STOK

def STOD(close, low, high, n):
 STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
 STOD = STOK.rolling(3).mean()
 return STOD

dataset['%K10'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%D10'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%K30'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%D30'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%K200'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 200)
dataset['%D200'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 200)

#Calculation of moving average
def MA(df, n):
    MA = pd.Series(df['Close'].rolling(n, min_periods=n).mean(), name='MA_' + str(n))
    return MA
dataset['MA21'] = MA(dataset, 10)
dataset['MA63'] = MA(dataset, 30)
dataset['MA252'] = MA(dataset, 200)
dataset.tail()

	Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price	short_mavg	long_mavg	signal	...	RSI200	%K10	%D10	%K30	%D30	%K200	%D200	MA21	MA63	MA252
2841372	2190.49	2190.49	2181.37	2181.37	1.700	3723.785	2190.247	2179.259	2189.616	0.0	...	46.613	56.447	73.774	47.883	59.889	16.012	18.930	2179.259	2182.291	2220.727
2841373	2190.50	2197.52	2186.17	2195.63	6.561	14402.812	2195.206	2181.622	2189.877	0.0	...	47.638	93.687	71.712	93.805	65.119	26.697	20.096	2181.622	2182.292	2220.295
2841374	2195.62	2197.52	2191.52	2191.83	15.663	34361.024	2193.792	2183.605	2189.943	0.0	...	47.395	80.995	77.043	81.350	74.346	23.850	22.186	2183.605	2182.120	2219.802
2841375	2195.82	2216.00	2195.82	2203.51	27.090	59913.493	2211.621	2187.018	2190.204	0.0	...	48.213	74.205	82.963	74.505	83.220	32.602	27.716	2187.018	2182.337	2219.396
2841376	2201.70	2209.81	2196.98	2208.33	9.962	21972.309	2205.649	2190.712	2190.510	1.0	...	48.545	82.810	79.337	84.344	80.066	36.440	30.964	2190.712	2182.715	2218.980

5 rows × 29 columns

dataset.tail()

	Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price	short_mavg	long_mavg	signal	...	RSI200	%K10	%D10	%K30	%D30	%K200	%D200	MA21	MA63	MA252
2841372	2190.49	2190.49	2181.37	2181.37	1.700	3723.785	2190.247	2179.259	2189.616	0.0	...	46.613	56.447	73.774	47.883	59.889	16.012	18.930	2179.259	2182.291	2220.727
2841373	2190.50	2197.52	2186.17	2195.63	6.561	14402.812	2195.206	2181.622	2189.877	0.0	...	47.638	93.687	71.712	93.805	65.119	26.697	20.096	2181.622	2182.292	2220.295
2841374	2195.62	2197.52	2191.52	2191.83	15.663	34361.024	2193.792	2183.605	2189.943	0.0	...	47.395	80.995	77.043	81.350	74.346	23.850	22.186	2183.605	2182.120	2219.802
2841375	2195.82	2216.00	2195.82	2203.51	27.090	59913.493	2211.621	2187.018	2190.204	0.0	...	48.213	74.205	82.963	74.505	83.220	32.602	27.716	2187.018	2182.337	2219.396
2841376	2201.70	2209.81	2196.98	2208.33	9.962	21972.309	2205.649	2190.712	2190.510	1.0	...	48.545	82.810	79.337	84.344	80.066	36.440	30.964	2190.712	2182.715	2218.980

5 rows × 29 columns

#excluding columns that are not needed for our prediction.

dataset=dataset.drop(['High','Low','Open', 'Volume_(Currency)','short_mavg','long_mavg'], axis=1)

dataset = dataset.dropna(axis=0)

dataset.tail()

	Close	Volume_(BTC)	Weighted_Price	signal	EMA10	EMA30	EMA200	ROC10	ROC30	MOM10	...	RSI200	%K10	%D10	%K30	%D30	%K200	%D200	MA21	MA63	MA252
2841372	2181.37	1.700	2190.247	0.0	2181.181	2182.376	2211.244	0.431	-0.649	8.42	...	46.613	56.447	73.774	47.883	59.889	16.012	18.930	2179.259	2182.291	2220.727
2841373	2195.63	6.561	2195.206	0.0	2183.808	2183.231	2211.088	1.088	-0.062	23.63	...	47.638	93.687	71.712	93.805	65.119	26.697	20.096	2181.622	2182.292	2220.295
2841374	2191.83	15.663	2193.792	0.0	2185.266	2183.786	2210.897	1.035	-0.235	19.83	...	47.395	80.995	77.043	81.350	74.346	23.850	22.186	2183.605	2182.120	2219.802
2841375	2203.51	27.090	2211.621	0.0	2188.583	2185.058	2210.823	1.479	0.297	34.13	...	48.213	74.205	82.963	74.505	83.220	32.602	27.716	2187.018	2182.337	2219.396
2841376	2208.33	9.962	2205.649	1.0	2192.174	2186.560	2210.798	1.626	0.516	36.94	...	48.545	82.810	79.337	84.344	80.066	36.440	30.964	2190.712	2182.715	2218.980

5 rows × 23 columns

## 4.4. Data Visualization

dataset[['Weighted_Price']].plot(grid=True)
plt.show()

fig = plt.figure()
plot = dataset.groupby(['signal']).size().plot(kind='barh', color='red')
plt.show()

The predicted variable is upward 52.87% out of total data-size, meaning that number of the buy signals was higher than that of sell signals.

# 5. Evaluate Algorithms and Models

## 5.1. Train Test Split

We split the dataset into 80% training set and 20% test set.

# split out validation dataset for the end
subset_dataset= dataset.iloc[-10000:]
Y= subset_dataset["signal"]
X = subset_dataset.loc[:, dataset.columns != 'signal']
validation_size = 0.2
seed = 1
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=1)

Data Standardisation

As a preprocessing step, let’s start with normalizing the feature values so they standardised - this makes comparisons simpler and allows next steps for Singular Value Decomposition.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
rescaledDataset = pd.DataFrame(scaler.fit_transform(X_train),columns = X_train.columns, index = X_train.index)
# summarize transformed data
X_train.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
rescaledDataset.head(2)

	Close	Volume_(BTC)	Weighted_Price	EMA10	EMA30	EMA200	ROC10	ROC30	MOM10	MOM30	...	RSI200	%K10	%D10	%K30	%D30	%K200	%D200	MA21	MA63	MA252
2834071	1.072	-0.367	1.040	1.064	1.077	1.014	0.005	-0.159	0.009	-0.183	...	-0.325	1.322	0.427	-0.205	-0.412	0.714	0.673	1.061	1.086	0.895
2836517	-1.738	1.126	-1.714	-1.687	-1.653	-1.733	-0.533	-0.597	-0.066	-0.416	...	-0.465	-1.620	-0.511	-1.283	-0.970	-0.988	-0.788	-1.685	-1.643	-1.662

2 rows × 22 columns

## 5.2. Singular Value Decomposition-(Feature Reduction)

We want to reduce the dimensionality of the problem to make it more manageable, but at the same time we want to preserve as much information as possible.

Hence, we use a technique called singu‐ lar value decomposition (SVD), which is one of the ways of performing PCA.Singular Value Decomposition (SVD) is a matrix factorization commonly used in signal processing and data compression. We are using the TruncatedSVD method in the sklearn package.

from matplotlib.ticker import MaxNLocator
ncomps = 5
svd = TruncatedSVD(n_components=ncomps)
svd_fit = svd.fit(rescaledDataset)
plt_data = pd.DataFrame(svd_fit.explained_variance_ratio_.cumsum()*100)
plt_data.index = np.arange(1, len(plt_data) + 1)
Y_pred = svd.fit_transform(rescaledDataset)
ax = plt_data.plot(kind='line', figsize=(10, 4))
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.set_xlabel("Eigenvalues")
ax.set_ylabel("Percentage Explained")
ax.legend("")
print('Variance preserved by first 5 components == {:.2%}'.format(svd_fit.explained_variance_ratio_.cumsum()[-1]))

Variance preserved by first 5 components == 92.75%

We can preserve 92.75% variance by using just 5 components rather than the full 25+ original features.

dfsvd = pd.DataFrame(Y_pred, columns=['c{}'.format(c) for c in range(ncomps)], index=rescaledDataset.index)
print(dfsvd.shape)
dfsvd.head()

(8000, 5)

	c0	c1	c2	c3	c4
2834071	-2.252	1.920	0.538	-0.019	-0.967
2836517	5.303	-1.689	-0.678	0.473	0.643
2833945	-2.315	-0.042	1.697	-1.704	1.672
2835048	-0.977	0.782	3.706	-0.697	0.057
2838804	2.115	-1.915	0.475	-0.174	-0.299

## 5.2.1. Basic Visualisation of Reduced Features

Lets attempt to visualise the data with the compressed dataset, represented by the top 5 components of an SVD.

svdcols = [c for c in dfsvd.columns if c[0] == 'c']

Pairs Plots

Pairs-plots are a simple representation using a set of 2D scatterplots, plotting each component against another component, and coloring the datapoints according to their classification (or type of signal).

plotdims = 5
ploteorows = 1
dfsvdplot = dfsvd[svdcols].iloc[:,:plotdims]
dfsvdplot['signal']=Y_train
ax = sns.pairplot(dfsvdplot.iloc[::ploteorows,:], hue='signal', size=1.8)

Observation:

In the scatter plot of each of the principal component, we can clearly that there is a clear segregation of the orange and blue dots, which means that data-points from the same type of signal tend to cluster together.
However, it’s hard to get a full appreciation of the differences and similarities between data points across all the components, requiring that the reader hold comparisons in their head while viewing

3D Scatterplot

As an alternative to the pairs-plots, we could view a 3D scatterplot, which at least lets us see more dimensions at once and possibly get an interactive feel for the data

def scatter_3D(A, elevation=30, azimuth=120):

    maxpts=1000
    fig = plt.figure(1, figsize=(9, 9))
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=elevation, azim=azimuth)
    ax.set_xlabel('component 0')
    ax.set_ylabel('component 1')
    ax.set_zlabel('component 2')

    # plot subset of points
    rndpts = np.sort(np.random.choice(A.shape[0], min(maxpts,A.shape[0]), replace=False))
    coloridx = np.unique(A.iloc[rndpts]['signal'], return_inverse=True)
    colors = coloridx[1] / len(coloridx[0])

    sp = ax.scatter(A.iloc[rndpts,0], A.iloc[rndpts,1], A.iloc[rndpts,2]
               ,c=colors, cmap="jet", marker='o', alpha=0.6
               ,s=50, linewidths=0.8, edgecolor='#BBBBBB')

    plt.show()

dfsvd['signal'] = Y_train
interactive(scatter_3D, A=fixed(dfsvd), elevation=30, azimuth=120)

interactive(children=(IntSlider(value=30, description='elevation', max=90, min=-30), IntSlider(value=120, desc…

Observation:

The iPython Notebook interactive package lets us create an interactive plot with controls for elevation and azimuth We can use these controls to interactively change the view of the top 3 components and investigate their relations. This certainly appears to be more informative than pairs-plots.

However, we still suffer from the same major limitations of the pairs-plots, namely that we lose a lot of the variance and have to hold a lot in our heads when viewing.

## 5.3. t-SNE visualization

In this step, we implement another technique of dimensionality reduction - t-SNE and look at the related visualization.We will use the basic implementation available in scikit-learn

tsne = TSNE(n_components=2, random_state=0)

Z = tsne.fit_transform(dfsvd[svdcols])
dftsne = pd.DataFrame(Z, columns=['x','y'], index=dfsvd.index)

dftsne['signal'] = Y_train

g = sns.lmplot('x', 'y', dftsne, hue='signal', fit_reg=False, size=8
                ,scatter_kws={'alpha':0.7,'s':60})
g.axes.flat[0].set_title('Scatterplot of a Multiple dimension dataset reduced to 2D using t-SNE')

Text(0.5, 1.0, 'Scatterplot of a Multiple dimension dataset reduced to 2D using t-SNE')

Observation:

This is quite interesting way of visualizing the trading signal data. The above plot shows us that there is a good degree of clustering for the trading signal. Although, there are some overap of the long and short signals, but they can be distinguished quite well using the reduced number of features.

In Review:

We have analyzed the bitcoin trading signal dataset in the following steps:

We prepared the data by cleaning (removing character features values, replacing nans) and normalizing.
We applied transformation during the feature reduction stage.
We then visualized the data in the reduced dimentionality and ultimately applied t-SNE algorithm to reduce the data into two dimensions and visualize effectivly

## 5.4. Compare Models-with and without dimensionality Reduction

   # test options for classification
   scoring = 'accuracy'

### 5.3.1. Models

import time
start_time = time.time()

# spot check the algorithms
models =  RandomForestClassifier(n_jobs=-1)
cv_results_XTrain= cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
print("Time Without Dimensionality Reduction--- %s seconds ---" % (time.time() - start_time))

Time Without Dimensionality Reduction--- 7.781347990036011 seconds ---

start_time = time.time()
X_SVD= dfsvd[svdcols].iloc[:,:5]
cv_results_SVD = cross_val_score(models, X_SVD, Y_train, cv=kfold, scoring=scoring)
print("Time with Dimensionality Reduction--- %s seconds ---" % (time.time() - start_time))

Time with Dimensionality Reduction--- 2.281977653503418 seconds ---

print("Result without dimensionality Reduction: %f (%f)" % (cv_results_XTrain.mean(), cv_results_XTrain.std()))
print("Result with dimensionality Reduction: %f (%f)" % (cv_results_SVD.mean(), cv_results_SVD.std()))

Result without dimensionality Reduction: 0.936375 (0.010774)
Result with dimensionality Reduction: 0.887500 (0.012698)

Looking at the model results, we do not deviate that much from the accuracy, and the accuracy just decreases from 93.6% to 88.7%. However, there is a 4 times improve‐ ment in the time taken, which is significant.

Conclusion:

With dimensionality reduction, we achieved almost the same accuracy with four times improvement in the time. In trading strategy development, when the datasets are huge and the number of features is big such improvement in time can lead to a significant improvement in the entire process.

We demonstrated that both SVD and t-SNE provide quite interesting way of visualizing the trading signal data, and provide a way to distinguished long and short signals of a trading strategy with reduced number of features.