Forecast Plugin

The Forecast plugin provides visual recipes in Dataiku DSS to work on time series data to solve forecasting problems.

Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.
-- Hyndman, Rob J. and George Athanasopoulos

An example of time series forcasting

Example of time series forecasting at the week level in Dataiku DSS.

Plugin information

Version0.1.0
AuthorDataiku (Alexandre Combessie)
Released2019-01-22
Last updated2019-01-22
LicenseMIT License
Source codeGithub
Reporting issuesGithub

Description

This plugin offers a set of 3 visual recipes to forecast yearly to hourly time series. It covers the full cycle of data cleaning, model training, evaluation and prediction.

It follows classic forecasting methods, as described in Hyndman, Rob J., and George Athanasopoulos (Forecasting: principles and practice. OTexts, 2018) and in Taylor, Sean J., and Benjamin Letham (Forecasting at Scale, The American Statistician, 2018).

This plugin does NOT work on narrow temporal dimensions (data must be at least at the hourly level) and does not provide signal processing techniques (Fourier Transform…).

This plugin works well when:

  • The training data consists of one or multiple time series at the hour, day, week, month or year level and fits in server’s RAM.
  • The object to predict is the future of one of these time series.

Forecasting is slightly different from "classic" Machine Learning (ML) as currently available visually in Dataiku. It is mainly different because:

  • Forecast models output a series of values whereas Visual ML outputs a single value.
  • Forecasting model open source implementations are different from the Python and Scala ones available in the Visual ML, and cannot be integrated as a custom model in it.
  • Evaluation of forecast accuracy uses specific methods (errors across a forecast horizon, cross-validation) which are not available in the Visual ML.

Installation Notes

The plugin can be installed from the Plugin Store or via the zip download (see above).

Note that the plugin uses an R code environment so R must be installed and integrated with Dataiku on your machine (version 3.5.0 or above).

You may encounter issues with the installation of the RStan package in the code environment on some operating systems. RStan has some system-level dependencies that may require additional setup. In this case, please see the RStan Getting Started wiki.

How to use

Clean recipe

Use this recipe to aggregate, resample, and clean missing values and outliers from the time series.
Inputs:
Dataset containing time series.
Output:
Dataset containing cleaned time series.

Clean recipe screenshot

Settings

Input Data

  • Time column. The column with time information in Dataiku date format (may need parsing beforehand in a Prepare recipe)
  • Series columns. The columns with time series numeric values.

Resampling and Aggregation
  • Time granularity. This determines the amount of time between data points in the cleaned dataset.
  • Aggregation method. When multiple rows fall within the same time period, they are aggregated into the cleaned dataset either by summming or avergaing their values.

Missing Values. Choose one of the following methods:
  • Interpolate uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is used. A linear interpolation is applied to the seasonally adjusted data, and then the seasonal component is added back.
  • Replace with average/median/fixed value.
  • Do nothing.

Outliers are detected by fitting a simple seasonal trend decomposition model using the tsclean method from the forecast package.

Train and Evaluate recipe

Use this recipe to train forecasting models and evaluate them on cleaned historical data.
Inputs:
Dataset with time series data (ideally the output of the Clean recipe)
Outputs:
Folder containing the forecast R objects.
Dataset containing the evaluation results.

Train and evaluate recipe screenshot

Settings

Input Data

  • Time column. The column with time information in Dataiku date format
  • Target series column. The time series you want to predict.
  • Time granularity. The amount of time between data points.

Modeling
  • Automated mode. Select which models to train. By default we only try two model types: Baseline and Prophet, as they converged for all the datasets used in our benchmarks. You may select more models, but be aware that some model types take more time to compute, or may fail to converge on datasets. In the latter case, you will get an error when running the recipe, telling you which model type to deactivate.

    The following models are available in the recipe:

    • Prophet
    • Neural Network
    • Seasonal Trend
    • Exponential Smoothing
    • ARIMA
    • State Space

  • Expert mode. Gives access to advanced parameters that are custom to each model type. For details, see the forecast R package.

Error Evaluation
  • Split (default). Train/test split where the test set consists of the last H values of the time series. You can change H with the horizon parameter. The models will be retrained on the train split and evaluated on their errors on the test split, for the entire forecast horizon.
  • Cross-validation.Time series method to split your dataset into multiple rolling train/test splits. The models will be retrained and evaluated on their errors for each split. Error metrics are then averaged across all splits. Each split is defined by a cutoff date: the train split is all data before or at the cutoff date, and the test split is the H values after cutoff. H is the horizon parameter, same as for the Split strategy. Cutoffs are made at regular intervals according to the "Cutoff period" parameter, but cannot be before the "Initial training" parameter. Having a large enough initial training set guarantees that the models trained on the first splits have enough data to converge. You may want to increase that parameter if you encounter model convergence errors.
    The exact method used for cross-validation is described in the Prophet documentation and explained in a slightly longer version by Hyndman.
    Note that cross-validation takes more time to compute since it involves as multiple retraining and evaluation of models. In contrast, the Split strategy only requires one retraining and evaluation. In order to alleviate that problem, we implemented retraining so that models are refit to each training split but hyperparameters are not re-estimated. This is done on purpose to accelerate the computation.

Predict recipe

Use this recipe to predict future values or produce historic residuals using a previously trained model.
Inputs:
Folder containing forecast R objects (from the Train and Evaluate recipe).
Dataset containing evaluation results (from the Train and Evaluate recipe).
Output:
Dataset containing forecasts.

Train and evaluate recipe screenshot

Settings

Model Selection. Choose how to select the model used for prediction: Automatic if you want to select the best model according to an error metric computed in the evaluation dataset input; Manual to select a model yourself.
Prediction. Choose whether you want to include the history, the forecast, or both. If you are including the forecast, specify the horizon and the probability percentage for the confidence interval.

The output dataset is a good candidate for the user to build charts to visually inspect the forecast results. Please see examples of such charts below:

History and forecast screenshot History and residuals screenshot