As an alternative to traditional time series models like ARIMA, you can use deep learning for forecasting.
This how-to walks through how to build a long short-term memory (LSTM) network, using Keras code in Dataiku’s Visual Machine Learning.
You should have some experience with Deep Learning in Dataiku.
You should have some familiarity with Keras.
You will need access to a code environment with the necessary libraries. When creating a code environment, you can add sets of packages on the Packages to Install tab. Choose the Visual Deep Learning package set that corresponds to the hardware you’re running on.
We’ll work with a dataset containing the daily minimum temperatures recorded in Australia over the course of a decade (1981-1990). Download the data in CSV format, then create a new project and upload the CSV to a new dataset.
The first step in preparing the data is simply to parse the dates from string format into date format, using a Prepare recipe. In the prepared dataset, you can create a basic line chart of the temperature by date, which reveals that the data is quite noisy. Therefore our model will probably only learn the general trends.
The next step is to create windows of input values. We are going to feed the LSTM with windows of 30 temperature values, and expect it to predict the 31st. We do this with a Python code recipe that serializes the window values in string format. The resulting dataset has 3 columns: the date of the target measurement, a vector of 30 values of “input” measured temperatures, and the target temperature.
Finally we are ready to divide the dataset into train and test sets. The model is trained on the first 8 years of data, then tested on the final 2 years of data.
In a Visual Analysis for the training dataset, create a new model with:
targetas the target variable
This creates a new machine learning task and opens the Design tab for the task. On the Target panel, verify that Dataiku DSS has correctly identified this as a Regression type of ML task.
On the Features Handling panel, verify that
date has been rejected as an input.
inputs should, by default, also be rejected as an input because its values appear to be unique. We need a custom processor that unserializes the input string to a vector, and then normalizes the temperature values to be between 0 and 1.
To create the custom processor, from the top navigation bar go to the Code menu > Libraries. Create a new file called
windowprocessor.py. The contents of the file should be as follows. It implements the following methods:
fit(), which computes the maximum and minimum values of the dataset
transform(), which normalizes the values to be between 0 and 1
_convert(), which transforms the data from an array of strings to a 2-dimensional array of floats
Back in the Features Handling panel of the deep learning model, turn on
inputs as a feature. Select Text as the variable type. Select Custom preprocessing as the type of text handling. Erase the default code and input the following. This calls the custom processor and tells it that our window has 30 values.
This custom features handling creates a new input to the deep learning model called
inputs_preprocessed. We’ll use that in the specification of the deep learning architecture.
We need to import the LSTM and Reshape layers in order to specify our architecture. Replace the first line of code with the following.
We now have to create our network architecture in the
build_model() function. Delete the default contents of
build_model() and insert the following.
There are three hidden layers. First is a Reshape layer, to convert from a shape of (batch_size, window_size) to (batch_size, window_size, dimension). Since we only have one input variable at each time step, the dimension is 1. After the reshaping, we can stack 2 layers of LSTM. The output layer is a fully connected layer, Dense, with one output neuron. By default its activation function is linear, which is appropriate for a regression problem.
We’ll make no changes to the
compile_model() function or the Training panel. If you have multiple code environments for deep learning, on the Runtime Environment panel, you should confirm that Dataiku has selected the code environment you want to use with this ML task.
Click Train and, when complete, deploy this model in the flow and score it on the test set. In the scored dataset, create a basic line chart of the temperatures by date, and you can see the model managed to pick up the general trend. It does not perfectly fit the curve, because it is generalizing. The minimum temperature in a country as vast as Australia can fluctuate a lot in a pseudo-random fashion.