Dimensionality Reduction-Bitcoin Price Prediction Problem
In this case study, we will use the dimensionality reduction approach to enhance the “bitcoin trading strategy” related case study discussed in Chapter 6.
Content
# 1. Problem Definition
In this case study, we will use the dimensionality reduction approach to enhance the “bitcoin trading strategy” related case study discussed in Chapter 6.
The data and the variables used in this case study are same as the case study presented in the classification case study chapter. The data is the bitcoin data for the time period of Jan 2012 to October 2017, with minute to minute updates of OHLC (Open, High, Low, Close), Volume in BTC and indicated currency and weighted bitcoin price
# 2. Getting Started- Loading the data and python packages
## 2.1. Loading the python packages
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from mpl_toolkits.mplot3d import Axes3D
import re
from collections import OrderedDict
from time import time
import sqlite3
from scipy.linalg import svd
from scipy import stats
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')
from IPython.html.widgets import interactive, fixed
## 2.2. Loading the Data
dataset = pd.read_csv(r'../../Chapter 6 - Sup. Learning - Classification models/CaseStudy3 - Bitcoin Trading Strategy/BitstampData.csv')
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
# 3. Exploratory Data Analysis
## 3.1. Descriptive Statistics
# shape
dataset.shape
(2841377, 8)
# peek at data
set_option('display.width', 100)
dataset.tail(5)
| Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
|---|---|---|---|---|---|---|---|---|
| 2841372 | 1496188560 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700166 | 3723.784755 | 2190.247337 |
| 2841373 | 1496188620 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561029 | 14402.811961 | 2195.206304 |
| 2841374 | 1496188680 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.662847 | 34361.023647 | 2193.791712 |
| 2841375 | 1496188740 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090309 | 59913.492565 | 2211.620837 |
| 2841376 | 1496188800 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.961835 | 21972.308955 | 2205.648801 |
# describe data
set_option('precision', 3)
dataset.describe()
| Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
|---|---|---|---|---|---|---|---|---|
| count | 2.841e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 |
| mean | 1.411e+09 | 4.959e+02 | 4.962e+02 | 4.955e+02 | 4.959e+02 | 1.188e+01 | 5.316e+03 | 4.959e+02 |
| std | 4.938e+07 | 3.642e+02 | 3.645e+02 | 3.639e+02 | 3.643e+02 | 4.094e+01 | 1.998e+04 | 3.642e+02 |
| min | 1.325e+09 | 3.800e+00 | 3.800e+00 | 1.500e+00 | 1.500e+00 | 0.000e+00 | 0.000e+00 | 3.800e+00 |
| 25% | 1.368e+09 | 2.399e+02 | 2.400e+02 | 2.398e+02 | 2.399e+02 | 3.828e-01 | 1.240e+02 | 2.399e+02 |
| 50% | 1.411e+09 | 4.200e+02 | 4.200e+02 | 4.199e+02 | 4.200e+02 | 1.823e+00 | 6.146e+02 | 4.200e+02 |
| 75% | 1.454e+09 | 6.410e+02 | 6.417e+02 | 6.402e+02 | 6.410e+02 | 8.028e+00 | 3.108e+03 | 6.410e+02 |
| max | 1.496e+09 | 2.755e+03 | 2.760e+03 | 2.752e+03 | 2.755e+03 | 5.854e+03 | 1.866e+06 | 2.754e+03 |
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Null Values = True
Given that there are null values, we need to clean the data by filling the NaNs with the last available values.
dataset[dataset.columns.values] = dataset[dataset.columns.values].ffill()
dataset=dataset.drop(columns=['Timestamp'])
## 4.2. Preparing the data for classification
We attach a label to each movement: * 1 if the signal is that short term price will go up as compared to the long term. * 0 if the signal is that short term price will go down as compared to the long term.
# Initialize the `signals` DataFrame with the `signal` column
#datas['PriceMove'] = 0.0
# Create short simple moving average over the short window
dataset['short_mavg'] = dataset['Close'].rolling(window=10, min_periods=1, center=False).mean()
# Create long simple moving average over the long window
dataset['long_mavg'] = dataset['Close'].rolling(window=60, min_periods=1, center=False).mean()
# Create signals
dataset['signal'] = np.where(dataset['short_mavg'] > dataset['long_mavg'], 1.0, 0.0)
dataset.tail()
| Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | short_mavg | long_mavg | signal | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2841372 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700 | 3723.785 | 2190.247 | 2179.259 | 2189.616 | 0.0 |
| 2841373 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561 | 14402.812 | 2195.206 | 2181.622 | 2189.877 | 0.0 |
| 2841374 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.663 | 34361.024 | 2193.792 | 2183.605 | 2189.943 | 0.0 |
| 2841375 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090 | 59913.493 | 2211.621 | 2187.018 | 2190.204 | 0.0 |
| 2841376 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.962 | 21972.309 | 2205.649 | 2190.712 | 2190.510 | 1.0 |
We perform feature engineering to construct technical indicators which will be used to make the predictions, and the output variable.
The current data of the bicoin consists of date, open, high, low, close and volume. Using this data we calculate the following technical indicators: * Moving Average : A moving average provides an indication of the trend of the price movement by cut down the amount of “noise” on a price chart. * Stochastic Oscillator %K and %D : A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time. %K and %D are slow and fast indicators. * Relative Strength Index(RSI) :It is a momentum indicator that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. * Rate Of Change(ROC): It is a momentum oscillator, which measures the percentage change between the current price and the n period past price. * Momentum (MOM) : It is the rate of acceleration of a security’s price or volume – that is, the speed at which the price is changing.
#calculation of exponential moving average
def EMA(df, n):
EMA = pd.Series(df['Close'].ewm(span=n, min_periods=n).mean(), name='EMA_' + str(n))
return EMA
dataset['EMA10'] = EMA(dataset, 10)
dataset['EMA30'] = EMA(dataset, 30)
dataset['EMA200'] = EMA(dataset, 200)
dataset.head()
#calculation of rate of change
def ROC(df, n):
M = df.diff(n - 1)
N = df.shift(n - 1)
ROC = pd.Series(((M / N) * 100), name = 'ROC_' + str(n))
return ROC
dataset['ROC10'] = ROC(dataset['Close'], 10)
dataset['ROC30'] = ROC(dataset['Close'], 30)
#Calculation of price momentum
def MOM(df, n):
MOM = pd.Series(df.diff(n), name='Momentum_' + str(n))
return MOM
dataset['MOM10'] = MOM(dataset['Close'], 10)
dataset['MOM30'] = MOM(dataset['Close'], 30)
#calculation of relative strength index
def RSI(series, period):
delta = series.diff().dropna()
u = delta * 0
d = u.copy()
u[delta > 0] = delta[delta > 0]
d[delta < 0] = -delta[delta < 0]
u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
u = u.drop(u.index[:(period-1)])
d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
d = d.drop(d.index[:(period-1)])
rs = u.ewm(com=period-1, adjust=False).mean() / \
d.ewm(com=period-1, adjust=False).mean()
return 100 - 100 / (1 + rs)
dataset['RSI10'] = RSI(dataset['Close'], 10)
dataset['RSI30'] = RSI(dataset['Close'], 30)
dataset['RSI200'] = RSI(dataset['Close'], 200)
#calculation of stochastic osillator.
def STOK(close, low, high, n):
STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
return STOK
def STOD(close, low, high, n):
STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
STOD = STOK.rolling(3).mean()
return STOD
dataset['%K10'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%D10'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%K30'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%D30'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%K200'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 200)
dataset['%D200'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 200)
#Calculation of moving average
def MA(df, n):
MA = pd.Series(df['Close'].rolling(n, min_periods=n).mean(), name='MA_' + str(n))
return MA
dataset['MA21'] = MA(dataset, 10)
dataset['MA63'] = MA(dataset, 30)
dataset['MA252'] = MA(dataset, 200)
dataset.tail()
| Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | short_mavg | long_mavg | signal | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2841372 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700 | 3723.785 | 2190.247 | 2179.259 | 2189.616 | 0.0 | ... | 46.613 | 56.447 | 73.774 | 47.883 | 59.889 | 16.012 | 18.930 | 2179.259 | 2182.291 | 2220.727 |
| 2841373 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561 | 14402.812 | 2195.206 | 2181.622 | 2189.877 | 0.0 | ... | 47.638 | 93.687 | 71.712 | 93.805 | 65.119 | 26.697 | 20.096 | 2181.622 | 2182.292 | 2220.295 |
| 2841374 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.663 | 34361.024 | 2193.792 | 2183.605 | 2189.943 | 0.0 | ... | 47.395 | 80.995 | 77.043 | 81.350 | 74.346 | 23.850 | 22.186 | 2183.605 | 2182.120 | 2219.802 |
| 2841375 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090 | 59913.493 | 2211.621 | 2187.018 | 2190.204 | 0.0 | ... | 48.213 | 74.205 | 82.963 | 74.505 | 83.220 | 32.602 | 27.716 | 2187.018 | 2182.337 | 2219.396 |
| 2841376 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.962 | 21972.309 | 2205.649 | 2190.712 | 2190.510 | 1.0 | ... | 48.545 | 82.810 | 79.337 | 84.344 | 80.066 | 36.440 | 30.964 | 2190.712 | 2182.715 | 2218.980 |
5 rows × 29 columns
dataset.tail()
| Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | short_mavg | long_mavg | signal | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2841372 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700 | 3723.785 | 2190.247 | 2179.259 | 2189.616 | 0.0 | ... | 46.613 | 56.447 | 73.774 | 47.883 | 59.889 | 16.012 | 18.930 | 2179.259 | 2182.291 | 2220.727 |
| 2841373 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561 | 14402.812 | 2195.206 | 2181.622 | 2189.877 | 0.0 | ... | 47.638 | 93.687 | 71.712 | 93.805 | 65.119 | 26.697 | 20.096 | 2181.622 | 2182.292 | 2220.295 |
| 2841374 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.663 | 34361.024 | 2193.792 | 2183.605 | 2189.943 | 0.0 | ... | 47.395 | 80.995 | 77.043 | 81.350 | 74.346 | 23.850 | 22.186 | 2183.605 | 2182.120 | 2219.802 |
| 2841375 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090 | 59913.493 | 2211.621 | 2187.018 | 2190.204 | 0.0 | ... | 48.213 | 74.205 | 82.963 | 74.505 | 83.220 | 32.602 | 27.716 | 2187.018 | 2182.337 | 2219.396 |
| 2841376 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.962 | 21972.309 | 2205.649 | 2190.712 | 2190.510 | 1.0 | ... | 48.545 | 82.810 | 79.337 | 84.344 | 80.066 | 36.440 | 30.964 | 2190.712 | 2182.715 | 2218.980 |
5 rows × 29 columns
#excluding columns that are not needed for our prediction.
dataset=dataset.drop(['High','Low','Open', 'Volume_(Currency)','short_mavg','long_mavg'], axis=1)
dataset = dataset.dropna(axis=0)
dataset.tail()
| Close | Volume_(BTC) | Weighted_Price | signal | EMA10 | EMA30 | EMA200 | ROC10 | ROC30 | MOM10 | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2841372 | 2181.37 | 1.700 | 2190.247 | 0.0 | 2181.181 | 2182.376 | 2211.244 | 0.431 | -0.649 | 8.42 | ... | 46.613 | 56.447 | 73.774 | 47.883 | 59.889 | 16.012 | 18.930 | 2179.259 | 2182.291 | 2220.727 |
| 2841373 | 2195.63 | 6.561 | 2195.206 | 0.0 | 2183.808 | 2183.231 | 2211.088 | 1.088 | -0.062 | 23.63 | ... | 47.638 | 93.687 | 71.712 | 93.805 | 65.119 | 26.697 | 20.096 | 2181.622 | 2182.292 | 2220.295 |
| 2841374 | 2191.83 | 15.663 | 2193.792 | 0.0 | 2185.266 | 2183.786 | 2210.897 | 1.035 | -0.235 | 19.83 | ... | 47.395 | 80.995 | 77.043 | 81.350 | 74.346 | 23.850 | 22.186 | 2183.605 | 2182.120 | 2219.802 |
| 2841375 | 2203.51 | 27.090 | 2211.621 | 0.0 | 2188.583 | 2185.058 | 2210.823 | 1.479 | 0.297 | 34.13 | ... | 48.213 | 74.205 | 82.963 | 74.505 | 83.220 | 32.602 | 27.716 | 2187.018 | 2182.337 | 2219.396 |
| 2841376 | 2208.33 | 9.962 | 2205.649 | 1.0 | 2192.174 | 2186.560 | 2210.798 | 1.626 | 0.516 | 36.94 | ... | 48.545 | 82.810 | 79.337 | 84.344 | 80.066 | 36.440 | 30.964 | 2190.712 | 2182.715 | 2218.980 |
5 rows × 23 columns
dataset[['Weighted_Price']].plot(grid=True)
plt.show()
fig = plt.figure()
plot = dataset.groupby(['signal']).size().plot(kind='barh', color='red')
plt.show()
The predicted variable is upward 52.87% out of total data-size, meaning that number of the buy signals was higher than that of sell signals.
# 5. Evaluate Algorithms and Models
## 5.1. Train Test Split
We split the dataset into 80% training set and 20% test set.
# split out validation dataset for the end
subset_dataset= dataset.iloc[-10000:]
Y= subset_dataset["signal"]
X = subset_dataset.loc[:, dataset.columns != 'signal']
validation_size = 0.2
seed = 1
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=1)
Data Standardisation
As a preprocessing step, let’s start with normalizing the feature values so they standardised - this makes comparisons simpler and allows next steps for Singular Value Decomposition.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
rescaledDataset = pd.DataFrame(scaler.fit_transform(X_train),columns = X_train.columns, index = X_train.index)
# summarize transformed data
X_train.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
rescaledDataset.head(2)
| Close | Volume_(BTC) | Weighted_Price | EMA10 | EMA30 | EMA200 | ROC10 | ROC30 | MOM10 | MOM30 | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2834071 | 1.072 | -0.367 | 1.040 | 1.064 | 1.077 | 1.014 | 0.005 | -0.159 | 0.009 | -0.183 | ... | -0.325 | 1.322 | 0.427 | -0.205 | -0.412 | 0.714 | 0.673 | 1.061 | 1.086 | 0.895 |
| 2836517 | -1.738 | 1.126 | -1.714 | -1.687 | -1.653 | -1.733 | -0.533 | -0.597 | -0.066 | -0.416 | ... | -0.465 | -1.620 | -0.511 | -1.283 | -0.970 | -0.988 | -0.788 | -1.685 | -1.643 | -1.662 |
2 rows × 22 columns
We want to reduce the dimensionality of the problem to make it more manageable, but at the same time we want to preserve as much information as possible.
Hence, we use a technique called singu‐ lar value decomposition (SVD), which is one of the ways of performing PCA.Singular Value Decomposition (SVD) is a matrix factorization commonly used in signal processing and data compression. We are using the TruncatedSVD method in the sklearn package.
from matplotlib.ticker import MaxNLocator
ncomps = 5
svd = TruncatedSVD(n_components=ncomps)
svd_fit = svd.fit(rescaledDataset)
plt_data = pd.DataFrame(svd_fit.explained_variance_ratio_.cumsum()*100)
plt_data.index = np.arange(1, len(plt_data) + 1)
Y_pred = svd.fit_transform(rescaledDataset)
ax = plt_data.plot(kind='line', figsize=(10, 4))
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.set_xlabel("Eigenvalues")
ax.set_ylabel("Percentage Explained")
ax.legend("")
print('Variance preserved by first 5 components == {:.2%}'.format(svd_fit.explained_variance_ratio_.cumsum()[-1]))
Variance preserved by first 5 components == 92.75%
We can preserve 92.75% variance by using just 5 components rather than the full 25+ original features.
dfsvd = pd.DataFrame(Y_pred, columns=['c{}'.format(c) for c in range(ncomps)], index=rescaledDataset.index)
print(dfsvd.shape)
dfsvd.head()
(8000, 5)
| c0 | c1 | c2 | c3 | c4 | |
|---|---|---|---|---|---|
| 2834071 | -2.252 | 1.920 | 0.538 | -0.019 | -0.967 |
| 2836517 | 5.303 | -1.689 | -0.678 | 0.473 | 0.643 |
| 2833945 | -2.315 | -0.042 | 1.697 | -1.704 | 1.672 |
| 2835048 | -0.977 | 0.782 | 3.706 | -0.697 | 0.057 |
| 2838804 | 2.115 | -1.915 | 0.475 | -0.174 | -0.299 |
Lets attempt to visualise the data with the compressed dataset, represented by the top 5 components of an SVD.
svdcols = [c for c in dfsvd.columns if c[0] == 'c']
Pairs Plots
Pairs-plots are a simple representation using a set of 2D scatterplots, plotting each component against another component, and coloring the datapoints according to their classification (or type of signal).
plotdims = 5
ploteorows = 1
dfsvdplot = dfsvd[svdcols].iloc[:,:plotdims]
dfsvdplot['signal']=Y_train
ax = sns.pairplot(dfsvdplot.iloc[::ploteorows,:], hue='signal', size=1.8)
Observation:
In the scatter plot of each of the principal component, we can clearly that there is a clear segregation of the orange and blue dots, which means that data-points from the same type of signal tend to cluster together.
However, it’s hard to get a full appreciation of the differences and similarities between data points across all the components, requiring that the reader hold comparisons in their head while viewing
3D Scatterplot
As an alternative to the pairs-plots, we could view a 3D scatterplot, which at least lets us see more dimensions at once and possibly get an interactive feel for the data
def scatter_3D(A, elevation=30, azimuth=120):
maxpts=1000
fig = plt.figure(1, figsize=(9, 9))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=elevation, azim=azimuth)
ax.set_xlabel('component 0')
ax.set_ylabel('component 1')
ax.set_zlabel('component 2')
# plot subset of points
rndpts = np.sort(np.random.choice(A.shape[0], min(maxpts,A.shape[0]), replace=False))
coloridx = np.unique(A.iloc[rndpts]['signal'], return_inverse=True)
colors = coloridx[1] / len(coloridx[0])
sp = ax.scatter(A.iloc[rndpts,0], A.iloc[rndpts,1], A.iloc[rndpts,2]
,c=colors, cmap="jet", marker='o', alpha=0.6
,s=50, linewidths=0.8, edgecolor='#BBBBBB')
plt.show()
dfsvd['signal'] = Y_train
interactive(scatter_3D, A=fixed(dfsvd), elevation=30, azimuth=120)
interactive(children=(IntSlider(value=30, description='elevation', max=90, min=-30), IntSlider(value=120, desc…
Observation:
The iPython Notebook interactive package lets us create an interactive plot with controls for elevation and azimuth We can use these controls to interactively change the view of the top 3 components and investigate their relations. This certainly appears to be more informative than pairs-plots.
However, we still suffer from the same major limitations of the pairs-plots, namely that we lose a lot of the variance and have to hold a lot in our heads when viewing.
## 5.3. t-SNE visualization
In this step, we implement another technique of dimensionality reduction - t-SNE and look at the related visualization.We will use the basic implementation available in scikit-learn
tsne = TSNE(n_components=2, random_state=0)
Z = tsne.fit_transform(dfsvd[svdcols])
dftsne = pd.DataFrame(Z, columns=['x','y'], index=dfsvd.index)
dftsne['signal'] = Y_train
g = sns.lmplot('x', 'y', dftsne, hue='signal', fit_reg=False, size=8
,scatter_kws={'alpha':0.7,'s':60})
g.axes.flat[0].set_title('Scatterplot of a Multiple dimension dataset reduced to 2D using t-SNE')
Text(0.5, 1.0, 'Scatterplot of a Multiple dimension dataset reduced to 2D using t-SNE')
Observation:
This is quite interesting way of visualizing the trading signal data. The above plot shows us that there is a good degree of clustering for the trading signal. Although, there are some overap of the long and short signals, but they can be distinguished quite well using the reduced number of features.
In Review:
We have analyzed the bitcoin trading signal dataset in the following steps:
We prepared the data by cleaning (removing character features values, replacing nans) and normalizing.
We applied transformation during the feature reduction stage.
We then visualized the data in the reduced dimentionality and ultimately applied t-SNE algorithm to reduce the data into two dimensions and visualize effectivly
## 5.4. Compare Models-with and without dimensionality Reduction
# test options for classification
scoring = 'accuracy'
### 5.3.1. Models
import time
start_time = time.time()
# spot check the algorithms
models = RandomForestClassifier(n_jobs=-1)
cv_results_XTrain= cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
print("Time Without Dimensionality Reduction--- %s seconds ---" % (time.time() - start_time))
Time Without Dimensionality Reduction--- 7.781347990036011 seconds ---
start_time = time.time()
X_SVD= dfsvd[svdcols].iloc[:,:5]
cv_results_SVD = cross_val_score(models, X_SVD, Y_train, cv=kfold, scoring=scoring)
print("Time with Dimensionality Reduction--- %s seconds ---" % (time.time() - start_time))
Time with Dimensionality Reduction--- 2.281977653503418 seconds ---
print("Result without dimensionality Reduction: %f (%f)" % (cv_results_XTrain.mean(), cv_results_XTrain.std()))
print("Result with dimensionality Reduction: %f (%f)" % (cv_results_SVD.mean(), cv_results_SVD.std()))
Result without dimensionality Reduction: 0.936375 (0.010774)
Result with dimensionality Reduction: 0.887500 (0.012698)
Looking at the model results, we do not deviate that much from the accuracy, and the accuracy just decreases from 93.6% to 88.7%. However, there is a 4 times improve‐ ment in the time taken, which is significant.
Conclusion:
With dimensionality reduction, we achieved almost the same accuracy with four times improvement in the time. In trading strategy development, when the datasets are huge and the number of features is big such improvement in time can lead to a significant improvement in the entire process.
We demonstrated that both SVD and t-SNE provide quite interesting way of visualizing the trading signal data, and provide a way to distinguished long and short signals of a trading strategy with reduced number of features.