howto

Forecasting Time Series With R

October 06, 2017

Forecasting time series data with R and Dataiku DSS

Do you day-trade stocks? Monitor humidity in the Amazon rainforest? Predict weekly orange production in the Florida keys? If so, you’re using time series!

A time series is when you measure the same variable at regular intervals. They occur everywhere in data science. R has several great packages that are built specifically to handle time series data.

This How-To walks through a time series analysis in R using Dataiku DSS. We’ll show how to explore time series data, choose an appropriate modeling method and deploy the model in DSS. Let’s get started!

Preparing the data

We’ll use a dataset with the monthly totals for international airline passengers provided by datamarket. When we upload the data into DSS, it automatically recognizes the Month column as a date that needs parsing. Pretty cool.

Raw data

A simple preparation step can convert this date to the standard format. See our documentation on dates for more information on this step.

We’ll also rename the column with the number of monthly passengers to International_passengers.

Cleaned data

Great! After running this recipe, our data is cleaned and ready for analysis.

Plotting

First, let’s create a chart to get a feel for the data. To do this, open the international_airline_passengers_prepared dataset and then click the Charts tab.

We select a Lines chart type, and drag Month_parsed into the field for the x-axis and International_passengers for the y-axis. After changing the date range, we have the line chart shown below.

Line chart of monthly average number of international airline passengers

We see two really interesting patterns. First, there’s a general upward trend in the number of passengers. Second, there is a yearly cycle with the lowest number of passengers occurring around the new year and the highest number of passengers during the late-summer. Let’s see if we can use these trends to forecast the number of passengers after 1960.

Interactive analysis with R

To start a notebook, I go back to the flow, click on the international_airline_passengers_prepared data set, click on Lab, New Code Notebook, R, and then Create.

Creating a new R notebook

Dataiku DSS will then open an R notebook with some basic starter coded already filled in.

R notebook with starter code

Sweet. Now that we have an R notebook, we’ll focus on the code. You can type the following code into the notebook for interactive analysis.

First, let’s load the R libraries that we need for this analysis. The dataiku library lets us read and write datasets to Dataiku DSS. The forecast library has the functions we need for training models to predict time series. The dplyr package has functions for manipulating data frames.

library(dataiku)
library(forecast)
library(dplyr)

Next, we’ll load the data into R from Dataiku DSS

df <- dkuReadDataset("international_airline_passengers_prepared",
                    samplingMethod="head", nbRows=100000)
head(df)

The top few rows of the data frame, displayed in R

Great! Now that we’ve loaded our data, let’s create a time series object using the ts() function.

This function takes a numeric vector, the start time and the frequency of measurement. For us, these values are the number of international passengers, 1949 (the year for which the measurements begin) and a frequency of 12 (months in a year).

ts_passengers = ts(df$International_passengers,
                    start=1949,
                    frequency=12)
plot(ts_passengers)

Time series plot of the average number of passengers by month

Excellent. We have our time series. It’s time to start modeling!

Choosing a forecasting model

We’re going to try three different forecasting methods and deploy the best to DSS. In general, it’s good practice to test several different modeling methods and choose the method that provides the best performance.

Model 1: Exponential State Smoothing

The ets() function in the forecast package fits exponential state smoothing (ETS) models. This function automatically optimizes the choice of model and necessary parameters. All you have to do is providing it with a time series.

Let’s use it and then make a forecast for the next 24 months.

m_ets = ets(ts_passengers)
f_ets = forecast(m_ets, h=24) # forecast 24 months into the future
plot(f_ets)

Plot of ETS forecasts

Looking good! The forecast is shown in blue with the grey area representing a 95% confidence interval. Just by looking, we see that the forecast roughly matches the historical pattern of the data.

Model 2: ARIMA

The auto.arima() function provides another modeling method. More info on the ARIMA model can be found here. The auto.arima() function automatically searches for the best model and optimizes the parameters. Using the auto.arima() is almost always better than calling the Arima() function directly.

Let’s give it a shot.

m_aa = auto.arima(ts_passengers)
f_aa = forecast(m_aa, h=24)
plot(f_aa)

Plot of ARIMA forecasts

Great! These confidence intervals seem bit smaller than those for the ETS model. Maybe this is because of a better fit to the data, but let’s train a third model before doing a model comparison.

Model 3: TBATS

The last model we’re going to train is a TBATS model. This model is designed for use when there are multiple cyclic patterns (e.g. daily, weekly and yearly patterns) in a single time series. Maybe it will be able to detect complicated patterns in our time series.

m_tbats = tbats(ts_passengers)
f_tbats = forecast(m_tbats, h=24)
plot(f_tbats)

Plot of TBATS forecasts

Now we have three models that all seem to give reasonable predictions. Let’s compare them to see which is performing the best.

Model comparison

We’ll use AIC to compare the different models. AIC is common method for determining how well a model fits the data, while penalizing more complex models. The model with the smallest AIC is the best fitting model.

barplot(c(ETS=m_ets$aic, ARIMA=m_aa$aic, TBATS=m_tbats$AIC),
    col="light blue",
    ylab="AIC")

Comparing AIC across models

We see that the ARIMA model performs the best. So, let’s go ahead and turn our interactive notebook into an R recipe that can be integrated into our Dataiku DSS workflow.

But before we can do this, we have to turn the output of forecast() into a data frame, so that we can pass it to Dataiku DSS.

The following code:

  1. Finds the last date for which we have a measurement.
  2. Creates a data frame with the prediction for each month. We’ll also include the lower and upper bounds of the predictions, and the date. Since we’re representing dates by the year, each month is 1/12 of a year.
  3. Splits the date column into separate columns for year and month
last_date = index(ts_passengers)[length(ts_passengers)]
data.frame(passengers_predicted=f_aa$mean,
           passengers_lower=f_aa$lower[,2],
           passengers_upper=f_aa$upper[,2],
           date=last_date + seq(1/12, 2, by=1/12)) %>%
    mutate(year=floor(date)) %>%
    mutate(month=round(((date %% 1) * 12) + 1)) -> forecast


Awesome! Now we have everything we need to deploy the model onto DSS: the code to create the forecast for the next 24 months and the code to convert the result into a data.frame.

Deploying the model in DSS

To deploy our model, we need to create a new R recipe. To do this, in the notebook click on +Create Recipe. Ensure that international_airline_passengers_prepared dataset is the input dataset, and create a new managed dataset, forecast, for the output of the recipe.

Setting the output of the R code recipe

Create the recipe, and DSS opens the recipe editor with the code from the notebook in the recipe.

R code recipe

We can optimize the code in the recipe to only run the portions that will output to the forecast dataset, but for now run the recipe and then return to the Flow where we see our newly created dataset.

Final flow with original dataset, Prepare recipe, and R code recipe

Open the forecast dataset and us look at our new predictions.

Explore view of the forecast dataset

If you thought this was helpful, check out our other tutorials.