Crypto & Stock: Daily Price Prediction using ML algorithms

Lakshmi Ajay
7 min readMay 28, 2023

--

Stock price prediction has been a critical area in research and has proven to be a tough nut to crack. It is probably one of the most sought-after applications in the finance domain.

With visibility into the stock price data for the past several years, the natural question that arises is,

Are there any patterns in the daily price variation that a machine-learning model can learn from?

If yes, machine learning models can be trained to learn the patterns from past stock price data. Based on these learnings, the model can predict how the prices will fluctuate in the near immediate future.

In this article, we will be using machine learning models to learn patterns based on technical indicators. Apart from the technical factors, other factors like the company’s fundamentals and the human psychological factors play an important role in deciding a stock’s price. Hence it becomes very difficult to predict the prices with high accuracy. The numbers from the presented models can only be used as guiding parameters.

DISCLAIMER

The article is intended only for educational purposes. Please use it at your own discretion for any real-world use case.

Overview

The goal of this article is to show how technical indicators can be added as features to the stock data and to develop a simple framework to train & test ML-based prediction models.

As shown in the illustration below, all the typical steps involved in solving a regression problem are applied. The same approach will work for any kind of input data with a daily price associated with it, be it a stock, cryptocurrency, or commodity.

In this example, we will be using the bitcoin (BTC-USD) to walk through the steps.

Cryptocurrency Price Prediction — High-level Flow

At a high level, the problem statement can be split into the following 4 steps.

Step 1 — Fetch the stock price data
Step 2 — Prepare the data with the technical indicators
Step 3 — Train & test the regression models
Step 4 — Compare the results

Refer to the GitHub repo linked at the end of the article to get full access to the source code. To avoid cluttering, the complete code-base is not included here.

Step 1 — Fetch the Data

Any package that can fetch the daily OHLC (Open-High-Low-Close) data can be used. In this example, the Yahoo finance package is used to fetch the daily data of BTC-USD.

Though the model will be trained for the last 5 years data, the data for the last 6 years is initially pulled so that the derived features like the 200-days moving average, which depends on the past dates can be computed.

In the final model, the last 5 years data is used to train the models, and the previous 5 days data is used to test the models.

import yfinance as yf
df = yf.Ticker("BTC-USD").history(period="6y")

A basic EDA is performed on the downloaded data.

It can be noticed that the distribution of the bitcoin prices is slightly left skewed. For linear models, it is fine if the target variable does not follow a normal distribution, only the residuals have to be normal. Hence it is not necessary to use the log of the target here.

Distribution of the target variable (BTC-USD closing price)

Also, there are no missing values or infinite values in the dataset. So we are good to proceed.

# Get count of infinite values
np.isinf(df.isin([np.inf, -np.inf])).values.sum()
# 0

# Get count of NaN values
df.isna().sum().sum()
# 0

Step 2— Prepare the Data

Feature Engineering

Before running the machine learning model, the input dataset needs to be prepared. Here, new features are derived from the downloaded price data. Technical indicators like the moving average, RSI, MACD, Bollinger Band, etc can be derived using past data.

Refer to this article, Beginner’s Guide to Technical Analysis in Python for Algorithmic Trading, to get a deeper understanding of each of these technical indicators.

The Python package TA-Lib is used to extract the features. The code to create this derived dataset is available in the GitHub repo. Snapshot of the final data after this feature engineering activity is shown below.

Sample data after adding technical indicators (Image by Author)

After adding the derived features, filter the data for the last 5 years for further processing.

# Consider last 5y data
start_date_5yrs = date.today() - relativedelta(years=5)
stock_derived_df = stock_derived_complete_df[stock_derived_complete_df.index > pd.to_datetime(start_date_5yrs)]
stock_derived_df.shape

# (1826, 23)

Train-test split

Split the dataframe so that the last 5 years data is used to train the model and the last 5 days data is used to backtest the model.

Train-Test split
Train Size: (1622, 19), train target shape:(1622,)
Test Size: (5, 19), test target shape:(5,)

Handle missing data

Check for any missing values in the data after adding the technical indicators.

stock_derived_df.isna().sum().sum()

# 0

There are no missing entries in the data.

Feature Scaling

Since the features are in varying ranges, it needs to be scaled to ensure that each feature contributes equally to the analysis. The two popular scalers are the min-max scaler and the standard scaler. The min-max scaler changes the data to a fixed range, normally between 0 and 1. The standard scaler scales the data to a mean of 0 and a standard deviation of 1. This maintains the distribution of the variable as in the original data.

The sklearn package in Python supports both the scaling techniques and provides the MinMaxScaler and the StandardScaler. Shown in the below code snippet is the MinMaxScaler to fit and transform the training data and use the same scaler to transform the test data.

#Scale the features
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()

def fit_transform_scaler(df_X):
global minmax_scaler
df_X_transformed = minmax_scaler.fit_transform(df_X)
df_X_scaled = pd.DataFrame(columns=df_X.columns, data=df_X_transformed, index=df_X.index)
return df_X_scaled

def transform_scaler(df_X):
global minmax_scaler
df_X_transformed = minmax_scaler.transform(df_X)
df_X_scaled = pd.DataFrame(columns=df_X.columns, data=df_X_transformed, index=df_X.index)
return df_X_scaled

Step 3 — Train & Test the Models

As part of this experiment, multiple regression models are trained and tested. Apart from the machine-learning regression models like the random forest, gradient boost, and XGB, time-series regression models like prophet and minirocket are also evaluated.

Basic hyperparameter tuning is done on the tree-based algorithms. The other algorithms have been run with their default settings.

The models are trained on the 5-year training data and then made to predict the closing price of the next five days. Shown below is a comparison of the predicted closing price of the different models

Model predictions for the next 5 days (Image by author)

Step 4— Compare the Model Results

Multiple ML models are trained on the data consisting of the past 5-years of data. The data from the last 5-days is used to test the models. The results are compared against the actual price movement.

The models should be trained separately for each scrip (cryptocurrency or stock) that needs to be evaluated. If there is any seasonality involved in any of the scrips, then the time-series models will consider the same.

The RMSE score is used to evaluate the models and compare them against each other. Basic hyperparameter tuning is done in this example as the goal is only to provide a framework for further practice.

5-days prediction across ML models — comparison (Image by Author)
RMSE for 5 days prediction (Image by Author)

The time-series regression algorithm, prophet, performs the best on this dataset. These are the results from the base configurations of the prophet and minirocket models, i.e. these models have not been tuned and the default configurations have been used.

For the tree-based models like the random forest, gradient boost, and XGB, hyperparameter tuning using the randomized search has been performed for 100 iterations with a cross-validation of 5 for each set of hyperparameters.

Doing it the Python way!!

Refer to this github repo to get access to the complete code: Crypto-Stock Predictor — using ML

What Next?

Presented here is the basic machine learning framework with technical indicators as features. Each of the models can be tuned further to identify the one that returns the best results. Running deep learning models like LSTM or Bi-LSTM could further enhance the results.

The fluctuation in the daily prices is way beyond the scope of the technical indicators used in this work and it depends on various other external factors. As mentioned initially, these numbers can only be used to give directional guidance and not to provide any buy/sell recommendations.

Happy Learning & Happy Trading!!

IF “PYTHON+TRADING” FASCINATES YOU THEN CHECK THESE OUT…

--

--