Time series forecasting is a subfield of Data Science, with many applications ranging from demand forecasting to production and inventory planning, supply chain management, signal processing, weather forecasting, pattern recognition, and epidemiology.
So, what is a time series? A time series is a series of observations ordered in time. And the goal of time series forecasting is to predict the future values based on the past observed values of the series.
A retail example, amount spent in different t-shirt models by quarter of the year 
Despite the abundance of successful applications, extensive literature, robust libraries and algorithms, and readily available datasets, time series forecasting is still a very challenging area for numerous data scientists. Not nearly as popular or “trendy” as image recognition and natural language processing, it gets none or very little attention in the majority of data science introductory courses or companies’ academies.
This article provides a short introduction to time series analysis, especially for forecasting. We start by introducing main concepts, like stationarity and ARIMA modeling. Then, we move on to how a time series forecast could be re-framed into a supervised machine learning problem. Throughout the article, some Python libraries and a short demo will be presented.
Time Series Properties and ARIMA Modelling
The majority of data science models have an implicit assumption that the relationship between outputs and inputs is static. If this relationship changes with time, the model could be making obsolete predictions.
In times series analysis, the way to check for the stability of the relationship between the inputs and outputs along time is to see if the time series is stationary. A time series is considered stationary if some statistical properties do not change over time, for example, its variance and mean remain constant. A quick test to confirm if a certain time series is stationary or not is by performing the Augmented Dickey-Fuller test.
The majority of real-world time series are non-stationary. And most of the classical time series forecasting techniques are focused on converting non-stationary time series into stationary ones.
One of the most famous forecasting times series techniques is ARIMA (autoregressive integrated moving average). ARIMA is a combination of methods that enables forecasting based on past values. It combines a linear regression model – which uses target lags as predictors – with differentiation to predict the future values. The differentiation is used to make the time series stationary by subtracting the previous value from the current value in a series.
ARIMA is a very well established and well-documented approach, however, its parameter configuration is quite complex, demanding a deep understanding of the data. This could be trivial when our forecasting problem is simple (just one time series to predict), but usually, in real-world data, that does not happen.
In real-world datasets, normally there is not just one individual time series present but many, sometimes even millions, which results in a huge number of ARIMA models to tune, with an even bigger number of parameters study. Also, each time series could have different behavior.
A good example is in retail. A big retailer will carry thousands of products in different locations, which results in millions of individual times series, one for each product/location combination. It is almost impossible for a data scientist, to analyse all these time series, therefore the utilization of an automatic model selection tool becomes a crucial factor for a successful analysis . Examples of popular Python packages for automatic model selection with documented good results are pyramid-arima and the prophet package developed by Facebook.
Modeling with Supervised Machine Learning
Another approach, which is the focus for the rest of the article, is transforming a time series forecasting problem into a supervised machine learning problem, and then use typical machine learning algorithms as elastic-net, random forests, gradient boosting machines, etc. to solve it.
With this procedure, it’s possible to overcome the aforementioned problem of the multiple independent time series, which may be very advantageous when some of the individual series have little or no data.
Using the retailing example once again, some products may have no records for certain periods due to a variety of reasons, such as no demand at that time/location pair, stock shortages, or the product is new. By concatenating all these individual time series into a single dataset and training a single model instead of multiple models, the algorithm would be able to learn more subtle patterns which repeat across the time series and extrapolate patterns that appear in one series to a series with less information/data . Then, instead of having a huge number of models (which can result in difficulties during deployment and long training times), we have a single model for all these independent time series.
The simplest way to transform a time series forecast into a supervised learning problem is by creating lag features. The first approach is to predict the value of time t given the value at the previous time t-1.
A feature that is also useful is the difference between a point in the time (t) and the previous observation (t-1). This attribute should be calculated from the lag features to avoid leakage.
It is possible to calculate numerous lags for a dataset. Depending on the business it can be useful to have lags from the last week, last month, last year… Using the Ridge Regression technique may be possible to select the lags that will have a bigger impact on the model or to detect non-stationarity in the series.
More advanced features can be computed by adding summary statistics of the values at previous time steps. This could be achieved with a sliding window of a fixed width and then summary statistics, as the mean, the maximums, minimums, or percentiles. When the mean is applied, it is called the rolling mean. An example of a rolling mean with a width of 3, would be mean(t-1, t-2, t-3). To calculate these rolling features, Pandas provides a useful function, rolling().
Another useful window is the expanding window, which includes all previous data in the series. It could help with keeping track of the bounds of the dataset. As in the rolling windows, summary statistics could be applied. The calculation of an expanded window in a dataset with just four rows would be something like this, mean(NaN, t-1, t-2, t-3). The Pandas function for expanded windows is the expanding().
It is also possible to extract meaningful features from the timestamp, as the day of the year, quarter, month, holidays, week. This is quite straightforward with Pandas.
Training and Validation
In most machine learning contexts, model development follows a standard procedure. Train the model on available data, evaluate model performance. If the performance achieves satisfactory results, the model is deployed into production, and its performance is monitored. If model performance degrades, a re-training is scheduled, using up-to-date training data.
This model building pipeline is not suited for time series forecasting. It is important to remember that the majority of real-world time series is not stationary, and therefore the series’ statistical properties change with time. To overcome this, the model needs to be retrained as new data comes in.
Also, common machine learning validation techniques such as cross-validation could induce leakage during the validation process. For example, it makes no sense to try to forecast values for May using training data from April and June.
A good validation procedure in time series is the “walk forward validation” that tries to recreate the traditional training/test methodology for a time series scenario. The idea here is to split the time series data at fixed time intervals, expanding the training data set in each fold. Scikit-learn has a function that implements this validation, the TimeSeriesSplit() function.
Walk forward validation visualization
Time series forecasting models could be evaluated through common regression evaluation methods as the Mean Absolute Error (MAE) or the R2 (R squared).
MAE is basically the average of the absolute differences between the predictions and the real values.
R2, also known as the coefficient of determination, is the percentage of the target variance that is explained by a linear model, in summary, it indicates the goodness of fit of a linear regression model.
Another interesting evaluation metric is the Mean Absolute Percentage Error (MAPE). This metric takes the form of a percentage, which is very useful for the business and can give a quick insight into how well the model performs. MAPE is simple to explain and scale-independent, being easy to apply to both high and low target quantities. MAPE is the average ratio of the absolute differences between the predictions and the real values dividing by the actual values.
Other evaluation metrics as the RMSE, MSE could also be used depending on the analysis.
A very important difference between typical machine learning problems (as classification) and time series forecasting is that it is more difficult to evaluate performance for certain evaluation metrics, due to stationarity problems or time series with little data. A model that has a MAPE of 5% may not be performing well or on contrary, a model with a MAPE of 30% does not mean that is a bad model. A way to overcome this handicap is by setting a baseline score that the model has to beat. An example of a baseline is last month’s sales as the prediction for the sales this month.
In certain applications, the uncertainty of the forecast is as crucial as the forecast itself. For example, in inventory applications and demand forecasting the uncertainty of the forecast, represented by forecast quantiles and intervals, can be used to calculate the safety stock, i. e. the additional quantity of inventory necessary to avoid losses. In financial time series forecasting, there are classes of models built explicitly for modeling the uncertainty of the series in place of the time series itself, such as GARCH and ARCH models.
In this demonstration, some of the above concepts will be shown. The dataset is from the UCI Repository  and it consists of 811 products/ independent time series along 52 weeks. In the next image you can see a sample of the dataset:
As referred previously, one of the approaches would be fitting one model to each of 811 products, although this could be problematic because each time series just has 52 data points, which is not much information for a single model.
Therefore, in this example, all the time series will be concatenated and instead of training multiple models with low data, we will train a single model but with much more data.
The first step is to transform the data into a supervised machine learning problem by creating lag features. For this purpose, the Python shift() function is used.
To improve the predictive power of the forecasting model, the difference between the sales in the previous week and the sales of the week before it (t-1 – t-2) is calculated with the diff() function. Therefore, two extra features are added (Last Week Sales and Last Week Diff), and the target is the Sales variable.
Setting a Baseline and a Validation Split
It is important to set an evaluation baseline that the model needs to beat to know if it is a useful model. In this example, a reasonably strong baseline is to use last week’s sales as a forecast for the sales this week.
The idea here is to train the model each week to predict the sales for the next week. For this purpose, a sliding window in a walk-forward validation fashion will be used. To avoid getting a good performance in a small sample of weeks just by luck, we will use the weeks from 40 to 51, repeating the process for one at a time, and compute the score .
The evaluation metric that we are going to use is the MAE because it is more intuitive to understand its values. The MAPE evaluation metric, although very straightforward in terms of business evaluation, in this dataset, since in some weeks for certain products the sales are zero, it is tricky to calculate, so we skip it in this demonstration for simplicity.
Due to the scarcity of available data, and since this example is just a short demo, there won’t be a train/ test split. Nevertheless, for more complex projects always keep some periods out of the validation set to access a more robust evaluation of the developed forecasting model .
Analysing the validation results for this baseline model, we can see that the average error is 2.7. This is the MAE that the machine learning model needs to beat.
A nice first model to try is a Random Forest. It is very robust to overfitting due to the high number of decision trees that could be used (we will use 1000), and because these trees could be trained in parallel it does not need a long time to train.
Applying the model to the data, we can see an error reduction of 10%, which is a good start.
Further Steps to Improve the Model
To improve the predicting performance of the model, new features could be calculated, and transforming the target distribution also might be important.
Histogram of the target distribution
Looking at the histogram of the target (Sales) distribution, we can see that the data does not follow a normal distribution, in fact, it is quite skewed. A way to address this issue is to compute the logarithm of the target, which changes the data distribution, hopefully to one more closely resembling a normal distribution.
We can also increase the number of lags and differences. Trying different algorithms could also lead to better results.
After creating lagged features up to six weeks behind the current week, taking the target log, and using another algorithm, it was possible to reduce the error compared to the baseline by almost 20%.
To Sum up
Time Series Forecasting is a complex subject in Machine Learning. It has a wide range of applications, from meteorology to economy and finance. This article covered the main characteristics of the time series, how to model them and how to transform a time series forecasting problem into a supervised machine learning approach. This approach is especially useful when multiple time series are present in the data. By the end, a demo of this supervised machine learning approach was presented.