Time Series Forecasting Using AUTO ARIMA + PROPHET + LightGBM
Predicting the price of Nifty 50 stocks using Machine Learning
In this post, I showcase an approach to get started in building the machine learning model that can forecast values and how it is different from supervised machine learning models. As a bonus lesson, I also showcase to you how to build interactive visuals using plotly. Let’s get started
What is Time series forecasting?
As the name suggests, time series is an ordered set of observations made over a period of time is time series. Since time series contain sequential data points mapped at successive time duration, it can be a very important tool for making predictions. Some of its major application areas include — stocks and financial trading, analysing online and offline retail sales, and medical records such as heart rate, EKG, MRI, and ECG.
Time series dataset evolves with lots of enthusiasm between data scientists. They are many different ways to approach a Time series problem. The following models are explored in the below notebook
- AUTO ARIMA
- Prophet
- LightGBM
Table Of Contents
- About the Dataset
- Loading the dataset Preprocessing
- Explanatory Data Analysis
- Featuring Engineering
- Stationary Conversion
- ARIMA and AUTO ARIMA
- Facebook Prophet
- Boosted Trees
- Summary
- Future Work
- Reference
№1:About the Dataset
The data is the price history and trading volumes of the fifty stocks in the index NIFTY 50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and trading values split across .cvs files for each stock along with a metadata file with some macro-information about the stocks themselves. The data spans from 1st January 2000 to 30th April 2021.
The dataset contains the following features
- Date — Trading day
- Symbol — Stock Name
- Prev Close — The closing price of the stock on the previous day
- Open — Opening Price on the given day
- High — Highest Price on the given day
- Low — Lowest Price on the given day
- Last — Last Price on the given day
- Close — Closing Price on the given day
- VWAP — Average price on the stock traded throughout the day
Since we have data of 50 stocks available, It would make sense to pick one stock at a time and perform the analysis, Hence I am picking HDFC bank for my analysis
№2:Loading the dataset Preprocessing
We load the dataset from Kaggle using custom made library opendatasets.
I also import other standard imports to perform data analysis, model building…etc
Below is the list of python libraries imported to work on the dataset
To download the dataset from Kaggle you need to generate the API key from the Kaggle website. You can go through this article to understand the process
Before we process any analysis we need to check if any junk data is available in our dataset, If yes then it needs to be cleaned up
As observed above Trades, Deliverable Volume and Deliverable contains null values. Let us try to get more details
We have missing values in Trades, deliverable volumes and %Deliverble. I choose to remove these columns as they might not add significant values to our analysis
№3:Explanatory Data Analysis
3.1 Plotting VWAP over time
Building interactive visuals using plotly
Observation
- A steady increase in stock price till July 2011.
- The stock fell considerably and was on the rise again till March 2020.
- The stock again sees a sharp fall in 2020. Can this be due to covid, let us check
It is evident that stock price felt the pressure during the Covid times and only started recovering in November
3.2 Plotting the moving average
While we analyse the stocks it would be a good point to have a look at the moving average to understand the dips and high levels. Traders prefer the following moving averages while making a call to buy or sell the stock
- 50 day MA
- 100 day MA
- 200 day MA
Although the series is concatenated with the data frame the position of the series within the data frame is not correct. We are required to shift the series based on the number of days the moving averages are calculated.
Pro Technique — Although plotting via matplotlib is easier, if you want interactive visuals you can build it using plotly
This analysis helps us with the following observations -
- During Covid times the stock has revolved around its 50 days moving average.
- The stock price is considerably up from its 200 and 100 days moving average indicating that this is a quality stock and has not been in the oversold zone.
3.3 Understanding the data distribution
Data is not normally distributed, however, this is what we usually expect from time series
3.4 Univariate analysis for High, Low, Open and Close
Insights
- There is not much deviation between all the parameters
- There are 2 dips one is in the year 2012 and the other one is in 2020.
Univariate analysis of Volume of share over the years
Insights
- There are been a lean period of share from the year 2000 to 2018
- HDFC has shown strong growth over the years hence after the year 2012 the volume has grown.
№4:Featuring Engineering
Calculating mean and standard deviation
The technique is commonly applied to time series to remove the noise and showcase the underlying causal signal. Below I consider moving mean and standard deviation for 3, 7, 30 days
№5:Stationary Conversion
Before we work on building ML models on the time series data. It is important to check if the series is stationary or not.
What is Stationary Check
A stationary series has the property that the mean, variance and autocorrelation structure do not change over time. Below the picture, illustration gives you a better idea
There are two ways a time series can be checked to be stationary
- By visual inspection — This can be done as per the image shown above
- Dickey-Fuller Test
It performs validation of null hypothesis and alternate hypothesis. In the current case, the null hypothesis would be the time series is not stationary whereas the alternate hypothesis would be time series is stationary.
The hypothesis is determined based on the p-value which gets calculated after applying the test
- p-value less than 0.05, Null hypothesis stands true and time series is not stationary
- p-value greater than 0.05, Null hypothesis stands false and time series is stationary
Let us perform this test. For our convenience python already has a library that performs all the underlying calculations
5.1 Decomposing time series
We decompose time series to understand the nature of it as an additive or multiplicative time series.
Additive time series Generally in additive time series the tread and seasonal variation are relatively constant over time. The additive series shows linear behaviour.
Multiplicative time series Generally the tread and seasonal variations increase or decrease in magnitude over time. The multiplicative time series show exponential behaviour.
It is an additive time series as we don't see exponential growth in magnitude.
5.2 Implementing Shift()
We implement the shift technique to make time-series stationary. It is a technique where we shift the data by one day and subtract the original values with the shifted values and record the difference in a separate column. Since we subtract yesterday value with today's value it will leave a constant value on its way thus making the plot stationary.
№6:ARIMA and AUTO ARIMA
While developing ARIMA models on a high level following steps are performed
- Making the time series stationary
- Computing the p,q and d values where p, q and d refers to the following
- p — Auto-Regressive
- q — Moving average
- d — differencing
AUTO ARIMA Auto ARIMA simplifies the above-mentioned process. It covers the time series to stationary and auto-calculates the p, q and d values.
Displaying the forecasted values in orange
Model evaluation can be done based on RMSE and MAE errors
№7:Facebook Prophet
Prophet is an open-source time series model developed by Facebook. It was released in early 2017.
It is observed both the errors at AUTO ARIMA is less than the prophet. Hence AUTO ARIMA is more accurate
№8:Boosted Trees
Traditionally boosted trees have performed best at Kaggle competition as they seem to minimise the error to the minimum. Let us check it out
AUTO ARIMA seems to have outperformed the remaining two models.
№9:Summary
Following is the summary of the steps we performed while doing the analysis.
- Downloaded the dataset from Kaggle.
- Performed pre-processing like checking for null values, removing redundant columns, importing all the required libraries.
- Performed explanatory data analysis
key insights
1. Steady growth of VWAP over time
2. The stock has been around its 50 days moving average from last year
3. The stock has major dips in the years 2012 and 2020. - Performed feature engineering, calculated mean and standard deviation over a period of 3,7 and 30 days.
- Did stationary check of the time series, ran DK fuller test and implemented shift() technique to convert the time series to stationary.
- Trained the model using AUTO ARIMA, Prophet and Boosted trees.
- While evaluating the RMSE scores. AUTO ARIMA seems to have performed best.
№10:Future Work
- Although AUTO ARIMA is very powerful, ARIMA can also be implemented to see the variance in the output.
- You can try training the model of XGBoost and see if it performs better than AUTO ARIMA.
№11:Reference
- https://www.kaggle.com/rohanrao/a-modern-time-series-tutorial/notebook
- https://www.kaggle.com/yashvi/time-series-analysis-and-forecasting-reliance
- https://www.kaggle.com/vikassingh1996/bajaj-stock-price-pred-xgb-fb-prophet-altair#XGBoost-Modeling-and-Forecasting
- https://www.kaggle.com/benroshan/reliance-nifty50-time-series-analysis#Model-building-Phase--Forecasting-&-Prediction
- Note book link — https://jovian.ai/hargurjeet/nifty-50-time-series-forecasting-3
I really hope you guys learned something from this post. Feel free to clap if you like what you learnt. Let me know if there is anything you need my help with. Feel free to reach out over LinkedIn