|Author||Dataiku (Alex Combessie)|
“Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.”
— Hyndman, Rob J. and George Athanasopoulos
This plugin offers a set of 3 visual recipes to forecast yearly to hourly time series. It covers the full cycle of data cleaning, model training, evaluation and prediction.
- Cleaning, aggregation, and resampling of time series data, i.e. data of one or several values measured over time (Recipe)
- Training of forecasting models of time series data, and evaluation of these models (Recipe)
- Predicting future values based on trained models (Recipe)
It follows classic forecasting methods, described in Hyndman, Rob J., and George Athanasopoulos (Forecasting: principles and practice. OTexts, 2018) and in Taylor, Sean J., and Benjamin Letham (Forecasting at Scale, The American Statistician, 2018).
This plugin does NOT work on narrow temporal dimensions (data must be at least at the hourly level) and does not provide signal processing techniques (Fourier Transform, …).
This plugin works well when:
- The training data consists of one or multiple time series at the hour, day, week, month or year level and fits in server’s RAM.
- The object to predict is the future of one of these time series.
Forecasting is slightly different from “classic” Machine Learning (ML) as currently available visually in Dataiku DSS, since:
- Forecasting models output multiple values, whereas one ML model usually has a single output.
- Forecasting models account for time natively, whereas ML models usually need some feature engineering (lagging, windowing) to model time.
- Measuring the accuracy of forecasting models requires specific methods (time series cross-validation, back-testing).
How to set up
As part of the installation process, the plugin will create a new R code environment. Hence, R must be installed and integrated with Dataiku on your machine prior to the installation.
You may need to follow this documentation if that is not the case. Note that the plugin requires at least R 3.5 and that Anaconda R is not supported.
How to use
1. Clean time series (optional)
Use this recipe to resample, aggregate and clean the time series from missing values and outliers
Dataset containing time series
Dataset with cleaned time series
- Time column: The column with time information in Dataiku date format (may need parsing beforehand in a Prepare recipe)
- Series columns: The columns with time series numeric values.
Resampling and Aggregation
- Time granularity: This determines the amount of time between data points in the cleaned dataset.
- Aggregation method: When multiple rows fall within the same time period, they are aggregated into the cleaned dataset either by summing (default) or averaging their values.
Missing Values: Choose one of the following imputation strategies:
- Interpolate (default) uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is used. Linear interpolation is applied to the seasonally adjusted data, and then the seasonal component is added back.
- Replace with average/median/fixed/previous value.
- Do nothing.
Outliers: Choose one of the following imputation strategies:
- Interpolate (default) uses the same technique as for missing values.
- Replace with average/median/fixed/previous value.
- Do nothing.
Outliers are detected by fitting a simple seasonal trend decomposition model using the tsclean method from the forecast package.
2. Train models and evaluate errors on historical data
Use this recipe to train forecasting models and evaluate them on cleaned historical data
A dataset with cleaned time series data, sampled at a regular granularity
Folder with the trained R forecasting models
Dataset with model evaluation results
- Time column: The column with time information in Dataiku date format
- Target column: The time series you want to predict
- External features (optional): Columns with external numeric regressors, for instance, indicators of holidays. Note that future values of these regressors are required when forecasting (next recipe). See below to know which models can use these features.
- Automated mode: Select which models to train, among
By default, we only activate two model types: Baseline and Neural Network, as they converged for all the datasets used in our benchmarks. You may select more models, but be aware that some model types take more time to compute, or may fail to converge on small datasets. In the latter case, you will get an error when running the recipe, telling you which model type to deactivate.
- Expert mode. Gives access to advanced parameters that are custom to each model type. For details on each parameter, please refer to the R package documentation.
- Train/Test Split (default): Train/test split where the test set consists of the last H values of the time series.
- You can change H with the Horizon parameter.
- The models will be trained on the train split and evaluated on their errors on the test split, for the entire forecast horizon.
- Time Series Cross-Validation: Split your dataset into multiple rolling train/test splits.
- The models will be retrained and evaluated on their errors for each split. Error metrics are then averaged across all splits.
- Each split is defined by a cutoff date: the train split is all data before or at the cutoff date, and the test split is the H values after cutoff. H is the Horizon parameter, same as for the Train/Test Split strategy.
- Cutoffs are made at regular intervals according to the Cutoff period parameter, but cannot be before the Initial training parameter.
- Both parameters are expressed in the number of time steps.
- Having a large enough initial training set guarantees that the models trained on the first splits have enough data to converge. You may want to increase that parameter if you encounter model convergence errors.
- The exact method used for cross-validation is described in the Prophet documentation and explained in a simpler implementation by Rob Hyndman.
- Note that Cross-Validation takes more time to compute since it involves multiple retraining and evaluation of models. In contrast, the Train/Test Split strategy only requires one retraining and evaluation.
- In order to alleviate that problem, we implemented retraining so that models are refit to each training split but hyperparameters are not re-estimated. This is done on purpose to accelerate computation.
3. Forecast future values and get historical residuals
Use this recipe to predict future values and/or get historic residuals using a previously trained model
Folder with the trained R forecasting models (from the previous recipe)
Dataset with model evaluation results (from the previous recipe)
Dataset with forecasts and/or historical residuals
Model Selection: Choose how to select the model used for prediction:
- Automatic: if you want to select the best model according to an error metric computed in the evaluation dataset input.
- Manual: if you want to select the model yourself.
Note that if you have run the Train recipe multiple times, the most recent training is always used by the model selection step.
Prediction: Choose whether you want to include history, the forecast, or both. If you are including the forecast, specify the Horizon and the probability percentage for the Confidence interval. If you are including the history, please note that residuals are equal to the historical value minus the one-step forecast.
The output dataset is a good candidate for the user to build charts to visually inspect the forecast results. Please see an example of such charts below:
Forecasts by entity a.k.a. partitioned forecasts
If you want to run the recipes to get one forecast for each entity (e.g. for each product or store), you will need partitioning. That requires to have all datasets partitioned by 1 dimension for the category, using the discrete dimension feature in Dataiku. If the input data is not partitioned, you can use a Sync recipe to repartition it, as explained in this article.