Time Series Forecasting Using AUTO ARIMA + PROPHET + LightGBM

Predicting the price of Nifty 50 stocks using Machine Learning

Hargurjeet
7 min readJan 16, 2022
Photo by Nick Chong on Unsplash

In this post, I showcase an approach to get started in building the machine learning model that can forecast values and how it is different from supervised machine learning models. As a bonus lesson, I also showcase to you how to build interactive visuals using plotly. Let’s get started

What is Time series forecasting?

As the name suggests, time series is an ordered set of observations made over a period of time is time series. Since time series contain sequential data points mapped at successive time duration, it can be a very important tool for making predictions. Some of its major application areas include — stocks and financial trading, analysing online and offline retail sales, and medical records such as heart rate, EKG, MRI, and ECG.

Time series dataset evolves with lots of enthusiasm between data scientists. They are many different ways to approach a Time series problem. The following models are explored in the below notebook

  • AUTO ARIMA
  • Prophet
  • LightGBM

Table Of Contents

  1. About the Dataset
  2. Loading the dataset Preprocessing
  3. Explanatory Data Analysis
  4. Featuring Engineering
  5. Stationary Conversion
  6. ARIMA and AUTO ARIMA
  7. Facebook Prophet
  8. Boosted Trees
  9. Summary
  10. Future Work
  11. Reference

№1:About the Dataset

Go to TOC

The data is the price history and trading volumes of the fifty stocks in the index NIFTY 50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and trading values split across .cvs files for each stock along with a metadata file with some macro-information about the stocks themselves. The data spans from 1st January 2000 to 30th April 2021.

The dataset contains the following features

  • Date — Trading day
  • Symbol — Stock Name
  • Prev Close — The closing price of the stock on the previous day
  • Open — Opening Price on the given day
  • High — Highest Price on the given day
  • Low — Lowest Price on the given day
  • Last — Last Price on the given day
  • Close — Closing Price on the given day
  • VWAP — Average price on the stock traded throughout the day

Since we have data of 50 stocks available, It would make sense to pick one stock at a time and perform the analysis, Hence I am picking HDFC bank for my analysis

№2:Loading the dataset Preprocessing

Go to TOC

We load the dataset from Kaggle using custom made library opendatasets.

I also import other standard imports to perform data analysis, model building…etc

Below is the list of python libraries imported to work on the dataset

To download the dataset from Kaggle you need to generate the API key from the Kaggle website. You can go through this article to understand the process

Before we process any analysis we need to check if any junk data is available in our dataset, If yes then it needs to be cleaned up

As observed above Trades, Deliverable Volume and Deliverable contains null values. Let us try to get more details

We have missing values in Trades, deliverable volumes and %Deliverble. I choose to remove these columns as they might not add significant values to our analysis

№3:Explanatory Data Analysis

Go to TOC

3.1 Plotting VWAP over time

Building interactive visuals using plotly

Observation

  • A steady increase in stock price till July 2011.
  • The stock fell considerably and was on the rise again till March 2020.
  • The stock again sees a sharp fall in 2020. Can this be due to covid, let us check

It is evident that stock price felt the pressure during the Covid times and only started recovering in November

3.2 Plotting the moving average

While we analyse the stocks it would be a good point to have a look at the moving average to understand the dips and high levels. Traders prefer the following moving averages while making a call to buy or sell the stock

  • 50 day MA
  • 100 day MA
  • 200 day MA

Although the series is concatenated with the data frame the position of the series within the data frame is not correct. We are required to shift the series based on the number of days the moving averages are calculated.

Pro Technique — Although plotting via matplotlib is easier, if you want interactive visuals you can build it using plotly

This analysis helps us with the following observations -

  • During Covid times the stock has revolved around its 50 days moving average.
  • The stock price is considerably up from its 200 and 100 days moving average indicating that this is a quality stock and has not been in the oversold zone.

3.3 Understanding the data distribution

Data is not normally distributed, however, this is what we usually expect from time series

3.4 Univariate analysis for High, Low, Open and Close

Insights

  • There is not much deviation between all the parameters
  • There are 2 dips one is in the year 2012 and the other one is in 2020.

Univariate analysis of Volume of share over the years

Insights

  • There are been a lean period of share from the year 2000 to 2018
  • HDFC has shown strong growth over the years hence after the year 2012 the volume has grown.

№4:Featuring Engineering

Go to TOC

Calculating mean and standard deviation

The technique is commonly applied to time series to remove the noise and showcase the underlying causal signal. Below I consider moving mean and standard deviation for 3, 7, 30 days

№5:Stationary Conversion

Go to TOC

Before we work on building ML models on the time series data. It is important to check if the series is stationary or not.

What is Stationary Check

A stationary series has the property that the mean, variance and autocorrelation structure do not change over time. Below the picture, illustration gives you a better idea

There are two ways a time series can be checked to be stationary

  • By visual inspection — This can be done as per the image shown above
  • Dickey-Fuller Test
    It performs validation of null hypothesis and alternate hypothesis. In the current case, the null hypothesis would be the time series is not stationary whereas the alternate hypothesis would be time series is stationary.

The hypothesis is determined based on the p-value which gets calculated after applying the test

  • p-value less than 0.05, Null hypothesis stands true and time series is not stationary
  • p-value greater than 0.05, Null hypothesis stands false and time series is stationary

Let us perform this test. For our convenience python already has a library that performs all the underlying calculations

5.1 Decomposing time series

We decompose time series to understand the nature of it as an additive or multiplicative time series.

Additive time series Generally in additive time series the tread and seasonal variation are relatively constant over time. The additive series shows linear behaviour.

Multiplicative time series Generally the tread and seasonal variations increase or decrease in magnitude over time. The multiplicative time series show exponential behaviour.

It is an additive time series as we don't see exponential growth in magnitude.

5.2 Implementing Shift()

We implement the shift technique to make time-series stationary. It is a technique where we shift the data by one day and subtract the original values with the shifted values and record the difference in a separate column. Since we subtract yesterday value with today's value it will leave a constant value on its way thus making the plot stationary.

№6:ARIMA and AUTO ARIMA

Go to TOC

While developing ARIMA models on a high level following steps are performed

  • Making the time series stationary
  • Computing the p,q and d values where p, q and d refers to the following
  • p — Auto-Regressive
  • q — Moving average
  • d — differencing

AUTO ARIMA Auto ARIMA simplifies the above-mentioned process. It covers the time series to stationary and auto-calculates the p, q and d values.

Displaying the forecasted values in orange

Model evaluation can be done based on RMSE and MAE errors

№7:Facebook Prophet

Go to TOC

Prophet is an open-source time series model developed by Facebook. It was released in early 2017.

It is observed both the errors at AUTO ARIMA is less than the prophet. Hence AUTO ARIMA is more accurate

№8:Boosted Trees

Go to TOC

Traditionally boosted trees have performed best at Kaggle competition as they seem to minimise the error to the minimum. Let us check it out

AUTO ARIMA seems to have outperformed the remaining two models.

№9:Summary

Go to TOC

Following is the summary of the steps we performed while doing the analysis.

  • Downloaded the dataset from Kaggle.
  • Performed pre-processing like checking for null values, removing redundant columns, importing all the required libraries.
  • Performed explanatory data analysis
    key insights
    1. Steady growth of VWAP over time
    2. The stock has been around its 50 days moving average from last year
    3. The stock has major dips in the years 2012 and 2020.
  • Performed feature engineering, calculated mean and standard deviation over a period of 3,7 and 30 days.
  • Did stationary check of the time series, ran DK fuller test and implemented shift() technique to convert the time series to stationary.
  • Trained the model using AUTO ARIMA, Prophet and Boosted trees.
  • While evaluating the RMSE scores. AUTO ARIMA seems to have performed best.

№10:Future Work

Go to TOC

  • Although AUTO ARIMA is very powerful, ARIMA can also be implemented to see the variance in the output.
  • You can try training the model of XGBoost and see if it performs better than AUTO ARIMA.

№11:Reference

Go to TOC

I really hope you guys learned something from this post. Feel free to clap if you like what you learnt. Let me know if there is anything you need my help with. Feel free to reach out over LinkedIn

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow