The ultimate time series workflow

An easier way of finding SARIMA(p,dq)(P,D,Q,m) parameters

Sebastien Sime
6 min readDec 24, 2021
(ML-for-beginner Microsoft program)

Time series are used on time stamped data to forecast future evolution based on past data. It is a special kind of machine learning analysis that require require special tools and methods as shown on the picture above (took from a Microsoft program).

To study time series models such as ARIMA or SARIMA are used since the latter could be quite effective at forecasting the future.

The challenge is that, using such models require to assess hyperparameters necessary to use the models. Finding the right parameters set could be a struggle even for experienced data scientist.

The aim of this post, is to show an effective way of finding the set of parameters more easily in an more effective way.

After reading this post you as well would be equipped to perform your time series analyses without the hustle of spending significant amount of time searching for the right parameters.

However, to be at ease when reading this post, you should have a little experience with statistics and time series as the main main focus here is both: to show an easier way of finding model parameters and to show an effective workflow.

You will find tat the end of the post the link to my Github for the full code used.

Introduction:

To show all the steps, we will use the well famous Mauna Loa C02 emission dataset (that you can find here). The data consist in CO2 emission at Mauna Loa and we would like to use past data to predict the future CO2 emission on a year.

View of the dataset five first rows

As we can see on the picture above, this are monthly data covering 1958 to 2010. We will mainly focus on the columns year, month and interpolated.

For this project I used Python with Google Colab.

The ultimate workflow:

First of all let’s define the classic workflow when dealing with time series:

  1. Perform some feature engineering to prepare the data for the analysis
  2. Visualize the data to understand the underlying characteristics (trend, seasonality, abrupt changes etc.)
  3. Find the parameters’ values of the ARIMA or SARIMA(with seasonality) model
  4. Split your data
  5. Fit the model
  6. Perform the prediction and Evaluate the results
  7. Retrain the model on the full data

Feature engineering:

Looking at the previous table, we need to have index with datetime format in order for time series processing to work. The other columns seems to be numerical so no further transformations are needed. In addition, the column interpolated does not have missing values.

After the transformation, we end up with the following Dataframe:

Dataset after transformation

Data visualisation:

The aim of the visualisation step, is to find the main characteristics of the data.

Plot of the data

Just looking at the above graph we can see that:

  • There is an increasing trend (so the data is not stationary)
  • There seems to be a seasonal effect (due to the repeating pattern or waves around the trend). Zooming on three years (see the picture below), we can more clearly see the pattern:
Pattern on one cycle (=12 months)

So from scratch, we know that we will need to account for the seasonal effect in our model (i.e. use a SARIMA model instead of ARIMA).

Next, in order to decompose the data into the seasonal and the trend components we need to know if we are dealing with an additive model or a multiplicative one. There are actually few methods to find out (see the full code on my Github below).

A simple method consist in performing the decomposition using both additive and multiplicative models as shown below:

Error-Trend-Seasonal decomposition with multiplicative (left) and additive (right) models

Looking at the residual and the seasonal plots, we can see that the distributions with the additive model are around zero (indicating a mean of zero).

So to continue we know that:

  • The seasonal component should be included
  • The cycle length is 12 (one year)
  • The model is not stationary as is and then a differentiation is needed to use SARIMA model

Defining the SARIMA parameters:

For the SARIMA model we have to find the 7 parameters (p, d, q, P, D, Q, m) before beginning the training process.

Finding the right set of parameters could be take quite some time using the classic approach. The latter require to:

  1. Find the number of time the data should be differentiated (the parameter d) using the Dickey-Fuller test
  2. Plot the ACF (autocorrelation function plot) and PACF (partial ACF) using the stationary data
Plots found in our case (after one differentiation)

3. Use and extensive amount of experience to deduct the parameters p and q from the above plot

As you can see the classic approach is quite iterative with always a human factor that could jeopardize all the study.

Nowadays with computer power, it is possible to use an automated approach that will try different sets of parameters and provide at the end the final set that is satisfactory.

So I used the Pyramid ARIMA python library which is both simple to use and quite effective since it leverage the Akaike information criterion (AIC) to find the best and less complicated model.

The library can be installed simply using pip install pmdarima.

Simply calling the auto_arima function of pmdarima on the interpolated column, provide the following result:

auto_arima function results

With just one line of code and about 5 mins of running time, we get ready to use parameters!😎

SARIMA parameters

Train \ Test split:

This step is pretty much straight forward since it only consist in dividing our data. About the size of the test set, a rule of thumb is to have a test set as long as the period we would like to predict.

As said at the beginning, if we want to forecast one year in the future, our test set should be one year (12 months) long.

Model training and prediction:

A SARIMA model can then be trained on the train set with the above parameters using the fit method. After prediction using the test set, we get the following plot:

Test and prediction plots

We can see that our model seems to perform pretty well. But we ned to confirm this using the Mean squared and square mean errors (MSE and RMSE).

Evaluate the results:

Using the MSE and RMES metrics, we can see below that the errors are small so we are confident in the model.

Final training on all the dataset:

Since we are confident in our model, we can then retrain the model with the found parameters on all the dataset. And it is only after this step, that we can confidently perform a forecast into the future.

Performing a forecast with our model give us the following result:

forecasting with our SARIMA model

Conclusion:

Time series analyses are quite powerful when it come to forecasting using time stamped data. But finding the right set of parameters to use in a ARIMA or SARIMA model can take quite some time when using the classic approach. Hopefully, there are some automated tool like the pmdarima python library that could use out of the box to directly find the parameter set to use.

This post and the full code below can give you as well what would become an industry standard in the future.

Full python code:

ssime-git/Mauna-Loa-co2-Time-series-analysis: Time serie analysis of CO2 emission at Mauna Loa (github.com)

--

--