Bitcoin Price Prediction

The goal of this case study is tu use classification based models to predict whether the current signal is buy or sell depending on the short term vs long term price.

Content

# 1. Problem Definition

The problem of predicting a buy or sell signal for a trading strategy is defined in the classification framework, where the predicted variable has a value of 1 for buy and 0 for sell.

The buy or sell signal are decided on the basis on the comparison of short term vs. long term price.

For the purpose of presented case study, we get the data from one of the largest Bit‐ coin exchanges in terms of average daily volume traded—Bitstamp (https:// www.bitstamp.com). Data can be found at: https://www.kaggle.com/mczielinski/bitcoin-historical-data

# 2. Getting Started- Loading the data and python packages

## 2.1. Loading the python packages

# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#Libraries for Deep Learning Models
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD
   Using TensorFlow backend.


## 2.2. Loading the Data

Note : Due to limit in the github for the data size, a sample of the data has been loaded in the jupyter notebook repository of this book. However, all the subsequent results in this jupyter notebook is with actual data (~150MB) under https://www.kaggle.com/mlfinancebook/bitstamp-bicoin-minutes-data. You should load the full data in case you want to reproduce the results.

# load dataset
dataset = pd.read_csv('BitstampData_sample.csv')
   #Diable the warnings
   import warnings
   warnings.filterwarnings('ignore')

# 3. Exploratory Data Analysis

## 3.1. Descriptive Statistics
# shape
dataset.shape
(2841377, 8)
# peek at data
set_option('display.width', 100)
dataset.tail(5)
Timestamp Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price
2841372 1496188560 2190.49 2190.49 2181.37 2181.37 1.700166 3723.784755 2190.247337
2841373 1496188620 2190.50 2197.52 2186.17 2195.63 6.561029 14402.811961 2195.206304
2841374 1496188680 2195.62 2197.52 2191.52 2191.83 15.662847 34361.023647 2193.791712
2841375 1496188740 2195.82 2216.00 2195.82 2203.51 27.090309 59913.492565 2211.620837
2841376 1496188800 2201.70 2209.81 2196.98 2208.33 9.961835 21972.308955 2205.648801
# describe data
set_option('precision', 3)
dataset.describe()
Timestamp Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price
count 2.841e+06 1.651e+06 1.651e+06 1.651e+06 1.651e+06 1.651e+06 1.651e+06 1.651e+06
mean 1.411e+09 4.959e+02 4.962e+02 4.955e+02 4.959e+02 1.188e+01 5.316e+03 4.959e+02
std 4.938e+07 3.642e+02 3.645e+02 3.639e+02 3.643e+02 4.094e+01 1.998e+04 3.642e+02
min 1.325e+09 3.800e+00 3.800e+00 1.500e+00 1.500e+00 0.000e+00 0.000e+00 3.800e+00
25% 1.368e+09 2.399e+02 2.400e+02 2.398e+02 2.399e+02 3.828e-01 1.240e+02 2.399e+02
50% 1.411e+09 4.200e+02 4.200e+02 4.199e+02 4.200e+02 1.823e+00 6.146e+02 4.200e+02
75% 1.454e+09 6.410e+02 6.417e+02 6.402e+02 6.410e+02 8.028e+00 3.108e+03 6.410e+02
max 1.496e+09 2.755e+03 2.760e+03 2.752e+03 2.755e+03 5.854e+03 1.866e+06 2.754e+03
# 4. Data Preparation ## 4.1. Data Cleaning
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Null Values = True

Given that there are null values, we need to clean the data by filling the NaNs with the last available values.

dataset[dataset.columns.values] = dataset[dataset.columns.values].ffill()
   dataset=dataset.drop(columns=['Timestamp'])

## 4.2. Preparing the data for classification

We attach a label to each movement: * 1 if the signal is that short term price will go up as compared to the long term. * 0 if the signal is that short term price will go down as compared to the long term.

# Initialize the `signals` DataFrame with the `signal` column
#datas['PriceMove'] = 0.0

# Create short simple moving average over the short window
dataset['short_mavg'] = dataset['Close'].rolling(window=10, min_periods=1, center=False).mean()

# Create long simple moving average over the long window
dataset['long_mavg'] = dataset['Close'].rolling(window=60, min_periods=1, center=False).mean()

# Create signals
dataset['signal'] = np.where(dataset['short_mavg'] > dataset['long_mavg'], 1.0, 0.0)
dataset.tail()
Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price short_mavg long_mavg signal
2841372 2190.49 2190.49 2181.37 2181.37 1.700 3723.785 2190.247 2179.259 2189.616 0.0
2841373 2190.50 2197.52 2186.17 2195.63 6.561 14402.812 2195.206 2181.622 2189.877 0.0
2841374 2195.62 2197.52 2191.52 2191.83 15.663 34361.024 2193.792 2183.605 2189.943 0.0
2841375 2195.82 2216.00 2195.82 2203.51 27.090 59913.493 2211.621 2187.018 2190.204 0.0
2841376 2201.70 2209.81 2196.98 2208.33 9.962 21972.309 2205.649 2190.712 2190.510 1.0
## 4.3. Feature Engineering

We begin by the constructing a dataset that contains the predictors which will be used to make the predictions, and the output variable.

The current Data of the bicoin consists of date, open, high, low, close and Volume. Using this data we calculate the following technical indicators: * Moving Average : A moving average provides an indication of the trend of the price movement by cut down the amount of “noise” on a price chart. * Stochastic Oscillator %K and %D : A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time. %K and %D are slow and fast indicators. * Relative Strength Index(RSI) :It is a momentum indicator that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. * Rate Of Change(ROC): It is a momentum oscillator, which measures the percentage change between the current price and the n period past price. * Momentum (MOM) : It is the rate of acceleration of a security’s price or volume – that is, the speed at which the price is changing.

#calculation of exponential moving average
def EMA(df, n):
    EMA = pd.Series(df['Close'].ewm(span=n, min_periods=n).mean(), name='EMA_' + str(n))
    return EMA
dataset['EMA10'] = EMA(dataset, 10)
dataset['EMA30'] = EMA(dataset, 30)
dataset['EMA200'] = EMA(dataset, 200)
dataset.head()

#calculation of rate of change
def ROC(df, n):
    M = df.diff(n - 1)
    N = df.shift(n - 1)
    ROC = pd.Series(((M / N) * 100), name = 'ROC_' + str(n))
    return ROC
dataset['ROC10'] = ROC(dataset['Close'], 10)
dataset['ROC30'] = ROC(dataset['Close'], 30)

#Calculation of price momentum
def MOM(df, n):
    MOM = pd.Series(df.diff(n), name='Momentum_' + str(n))
    return MOM
dataset['MOM10'] = MOM(dataset['Close'], 10)
dataset['MOM30'] = MOM(dataset['Close'], 30)

#calculation of relative strength index
def RSI(series, period):
 delta = series.diff().dropna()
 u = delta * 0
 d = u.copy()
 u[delta > 0] = delta[delta > 0]
 d[delta < 0] = -delta[delta < 0]
 u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
 u = u.drop(u.index[:(period-1)])
 d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
 d = d.drop(d.index[:(period-1)])
 rs = u.ewm(com=period-1, adjust=False).mean() / \
 d.ewm(com=period-1, adjust=False).mean()
 return 100 - 100 / (1 + rs)
dataset['RSI10'] = RSI(dataset['Close'], 10)
dataset['RSI30'] = RSI(dataset['Close'], 30)
dataset['RSI200'] = RSI(dataset['Close'], 200)

#calculation of stochastic osillator.

def STOK(close, low, high, n):
 STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
 return STOK

def STOD(close, low, high, n):
 STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
 STOD = STOK.rolling(3).mean()
 return STOD

dataset['%K10'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%D10'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%K30'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%D30'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%K200'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 200)
dataset['%D200'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 200)
#Calculation of moving average
def MA(df, n):
    MA = pd.Series(df['Close'].rolling(n, min_periods=n).mean(), name='MA_' + str(n))
    return MA
dataset['MA21'] = MA(dataset, 10)
dataset['MA63'] = MA(dataset, 30)
dataset['MA252'] = MA(dataset, 200)
dataset.tail()
Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price short_mavg long_mavg signal ... RSI200 %K10 %D10 %K30 %D30 %K200 %D200 MA21 MA63 MA252
2841372 2190.49 2190.49 2181.37 2181.37 1.700 3723.785 2190.247 2179.259 2189.616 0.0 ... 46.613 56.447 73.774 47.883 59.889 16.012 18.930 2179.259 2182.291 2220.727
2841373 2190.50 2197.52 2186.17 2195.63 6.561 14402.812 2195.206 2181.622 2189.877 0.0 ... 47.638 93.687 71.712 93.805 65.119 26.697 20.096 2181.622 2182.292 2220.295
2841374 2195.62 2197.52 2191.52 2191.83 15.663 34361.024 2193.792 2183.605 2189.943 0.0 ... 47.395 80.995 77.043 81.350 74.346 23.850 22.186 2183.605 2182.120 2219.802
2841375 2195.82 2216.00 2195.82 2203.51 27.090 59913.493 2211.621 2187.018 2190.204 0.0 ... 48.213 74.205 82.963 74.505 83.220 32.602 27.716 2187.018 2182.337 2219.396
2841376 2201.70 2209.81 2196.98 2208.33 9.962 21972.309 2205.649 2190.712 2190.510 1.0 ... 48.545 82.810 79.337 84.344 80.066 36.440 30.964 2190.712 2182.715 2218.980

5 rows × 29 columns

dataset.tail()
Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price short_mavg long_mavg signal ... RSI200 %K10 %D10 %K30 %D30 %K200 %D200 MA21 MA63 MA252
2841372 2190.49 2190.49 2181.37 2181.37 1.700 3723.785 2190.247 2179.259 2189.616 0.0 ... 46.613 56.447 73.774 47.883 59.889 16.012 18.930 2179.259 2182.291 2220.727
2841373 2190.50 2197.52 2186.17 2195.63 6.561 14402.812 2195.206 2181.622 2189.877 0.0 ... 47.638 93.687 71.712 93.805 65.119 26.697 20.096 2181.622 2182.292 2220.295
2841374 2195.62 2197.52 2191.52 2191.83 15.663 34361.024 2193.792 2183.605 2189.943 0.0 ... 47.395 80.995 77.043 81.350 74.346 23.850 22.186 2183.605 2182.120 2219.802
2841375 2195.82 2216.00 2195.82 2203.51 27.090 59913.493 2211.621 2187.018 2190.204 0.0 ... 48.213 74.205 82.963 74.505 83.220 32.602 27.716 2187.018 2182.337 2219.396
2841376 2201.70 2209.81 2196.98 2208.33 9.962 21972.309 2205.649 2190.712 2190.510 1.0 ... 48.545 82.810 79.337 84.344 80.066 36.440 30.964 2190.712 2182.715 2218.980

5 rows × 29 columns

#excluding columns that are not needed for our prediction.

dataset=dataset.drop(['High','Low','Open', 'Volume_(Currency)','short_mavg','long_mavg'], axis=1)
dataset = dataset.dropna(axis=0)
dataset.tail()
Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price short_mavg long_mavg signal ... RSI200 %K10 %D10 %K30 %D30 %K200 %D200 MA21 MA63 MA252
2841372 2190.49 2190.49 2181.37 2181.37 1.700 3723.785 2190.247 2179.259 2189.616 0.0 ... 46.613 56.447 73.774 47.883 59.889 16.012 18.930 2179.259 2182.291 2220.727
2841373 2190.50 2197.52 2186.17 2195.63 6.561 14402.812 2195.206 2181.622 2189.877 0.0 ... 47.638 93.687 71.712 93.805 65.119 26.697 20.096 2181.622 2182.292 2220.295
2841374 2195.62 2197.52 2191.52 2191.83 15.663 34361.024 2193.792 2183.605 2189.943 0.0 ... 47.395 80.995 77.043 81.350 74.346 23.850 22.186 2183.605 2182.120 2219.802
2841375 2195.82 2216.00 2195.82 2203.51 27.090 59913.493 2211.621 2187.018 2190.204 0.0 ... 48.213 74.205 82.963 74.505 83.220 32.602 27.716 2187.018 2182.337 2219.396
2841376 2201.70 2209.81 2196.98 2208.33 9.962 21972.309 2205.649 2190.712 2190.510 1.0 ... 48.545 82.810 79.337 84.344 80.066 36.440 30.964 2190.712 2182.715 2218.980

5 rows × 29 columns

## 4.4. Data Visualization
dataset[['Weighted_Price']].plot(grid=True)
plt.show()
CaseStudies/output_36_0.png
# histograms
dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1, figsize=(12,12))
plt.show()
CaseStudies/output_37_0.png
fig = plt.figure()
plot = dataset.groupby(['signal']).size().plot(kind='barh', color='red')
plt.show()
upward  52.87167614854236% out of 2342012 total datasize
CaseStudies/output_38_1.png

The predicted variable is upward 52.87% out of total data-size, meaning that number of the buy signals were more than the number of sell signals.

# correlation
correlation = dataset.corr()
plt.figure(figsize=(15,15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
<matplotlib.axes._subplots.AxesSubplot at 0x2a900887c50>

We split the dataset into 80% training set and 20% test set.

   # split out validation dataset for the end
   subset_dataset= dataset.iloc[-100000:]
   Y= subset_dataset["signal"]
   X = subset_dataset.loc[:, dataset.columns != 'signal']
   validation_size = 0.2
   seed = 1
   X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=1)

## 5.2. Test Options and Evaluation Metrics
   # test options for classification
   num_folds = 10
   seed = 7
   scoring = 'accuracy'
   #scoring = 'precision'
   #scoring = 'recall'
   #scoring ='neg_log_loss'
   #scoring = 'roc_auc'

## 5.3. Compare Models and Algorithms

In order to know which algorithm technic is the best for our strategy, we evaluate following non linear different methods:

### 5.3.1. Models

# spot check the algorithms
models = []
models.append(('LR', LogisticRegression(n_jobs=-1)))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#Neural Network
models.append(('NN', MLPClassifier()))
#Ensable Models
# Boosting methods
models.append(('AB', AdaBoostClassifier()))
models.append(('GBM', GradientBoostingClassifier()))
# Bagging methods
models.append(('RF', RandomForestClassifier(n_jobs=-1)))

K-folds cross validation

results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
LR: 0.955325 (0.003463)
LDA: 0.889112 (0.005189)
KNN: 0.869538 (0.003701)
CART: 0.895537 (0.002440)
NB: 0.789650 (0.004450)
NN: 0.842437 (0.042223)
AB: 0.896513 (0.003328)
GBM: 0.908988 (0.003471)
RF: 0.916100 (0.003465)
# compare algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(15,8)
plt.show()

Random forest is selected for the grid search as it is one of the best models out of all.

# Grid Search: Random Forest Classifier
'''
n_estimators : int (default=100)
    The number of boosting stages to perform.
    Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
    maximum depth of the individual regression estimators.
    The maximum depth limits the number of nodes in the tree.
    Tune this parameter for best performance; the best value depends on the interaction of the input variables
criterion : string, optional (default=”gini”)
    The function to measure the quality of a split.
    Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

'''
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
n_estimators = [20,80]
max_depth= [5,10]
criterion = ["gini","entropy"]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, criterion = criterion )
model = RandomForestClassifier(n_jobs=-1)
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)

#Print Results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
ranks = grid_result.cv_results_['rank_test_score']
for mean, stdev, param, rank in zip(means, stds, params, ranks):
    print("#%d %f (%f) with: %r" % (rank, mean, stdev, param))
   Best: 0.903438 using {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80}
   #7 0.868487 (0.002811) with: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 20}
   #5 0.870788 (0.002482) with: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 80}
   #2 0.902900 (0.002600) with: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 20}
   #1 0.903438 (0.003581) with: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80}
   #8 0.865300 (0.004989) with: {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 20}
   #6 0.869950 (0.003233) with: {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 80}
   #4 0.900088 (0.003467) with: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 20}
   #3 0.902438 (0.003209) with: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 80}


# 7. Finalise the Model

Finalizing the model with best parameters found during tuning step.

## 7.1. Results on the Test Dataset

# prepare model
model = RandomForestClassifier(criterion='gini', n_estimators=80,max_depth=10,n_jobs=-1) # rbf is default kernel
#model = LogisticRegression()
model.fit(X_train, Y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=80, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)
# estimate accuracy on validation set
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
0.9075
[[ 7852   966]
 [  884 10298]]
              precision    recall  f1-score   support

         0.0       0.90      0.89      0.89      8818
         1.0       0.91      0.92      0.92     11182

    accuracy                           0.91     20000
   macro avg       0.91      0.91      0.91     20000
weighted avg       0.91      0.91      0.91     20000
df_cm = pd.DataFrame(confusion_matrix(Y_validation, predictions), columns=np.unique(Y_validation), index = np.unique(Y_validation))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font sizes
<matplotlib.axes._subplots.AxesSubplot at 0x2a92ca6b7f0>

Feature Importance of the model

Importance = pd.DataFrame({'Importance':model.feature_importances_*100}, index=X.columns)
Importance.sort_values('Importance', axis=0, ascending=True).plot(kind='barh', color='r' )
plt.xlabel('Variable Importance')
Text(0.5, 0, 'Variable Importance')
#Create column for Strategy Returns by multiplying the daily returns by the position that was held at close
#of business the previous day
backtestdata = pd.DataFrame(index=X_validation.index)
#backtestdata = pd.DataFrame()
backtestdata['signal_pred'] = predictions
backtestdata['signal_actual'] = Y_validation
backtestdata['Market Returns'] = X_validation['Close'].pct_change()
backtestdata['Actual Returns'] = backtestdata['Market Returns'] * backtestdata['signal_actual'].shift(1)
backtestdata['Strategy Returns'] = backtestdata['Market Returns'] * backtestdata['signal_pred'].shift(1)
backtestdata=backtestdata.reset_index()
backtestdata.head()
backtestdata[['Strategy Returns','Actual Returns']].cumsum().hist()
backtestdata[['Strategy Returns','Actual Returns']].cumsum().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x2a942819780>
CaseStudies/output_66_1.png CaseStudies/output_66_2.png

Conclusion:

We showed that framing the problem is the first priority and we address it by engi‐ neering the features and transforming the labels according to the investment objective.

We demonstrated the efficiency of using feature engineering that leads to creation of intuitive features related to the trend and momentum of the price movement and increases the predictive power of the model.

In terms of the evaluation metrics for a classification-based trading strategy, accuracy or auc are appropriate, but in case the strategy is focusing to be more accurate while going long, the metric recall which focuses on less false positive can be preferred as compared to accuracy.

Finally, we demonstrated the backtesting framework which allows us to simulate a trading strategy using historical data to generate results and analyze risk and profita‐ bility before risking any actual capital.