howto

Do you day-trade stocks? Monitor humidity in the Amazon rainforest? Predict weekly orange production in the Florida keys? If so, you’re using time series!

A time series is when you measure the same variable at regular intervals. They occur everywhere in data science. R has several great packages that are built specifically to handle time series data.

This tutorial walks through a time series analysis in R using Dataiku DSS. I’m going to show you how to explore time series data, choose an appropriate modeling method and deploy the model in DSS. Let’s get started!

I’m using a dataset with the monthly totals for international airline passengers provided by datamarket. When I upload the data into DSS, it automatically recognizes the Month column as a date that needs parsing. Pretty cool.

A simple preparation step can convert this date to the standard format. See our documentation on dates for more information on this step.

Great! Now our data is cleaned and ready for analysis.

First, I’m going to create a chart to get a feel for the data. To do this, I click on `internation_airline_passengers_cleaned`

and then the *Analyse* icon.

Then I’m going to click on *charts* at the top, and drag `Month_parsed`

into the field for the x-axis and `Internation airline passegers`

into the y-axis. After a bit of tweaking, we have the line chart shown below.

We see two really interesting patterns. First, there’s a general upward trend in the number of passengers. Second, there is a yearly cycle with the lowest number of passengers occuring around the new year and the highest number of passengers during the late-summer. Let’s see if we can use these trends to forecast the number of passengers after 1960.

To start a notebook, I go back to the flow, click on the `internation_airline_passengers_cleaned`

data set, click on the R icon and then click “Notebook Interactive visualisation and analysis of your data”.

DSS will then open an R notebook with some basic starter coded already filled in.

Sweet. Now that we have an R notebook, I’m going to stop those screen shots and just show the code. You can type the following code into the iPython notebook for interactive analysis.

First, I’m going to load the R libraries that we need for this analysis. The `dataiku`

library lets us read and write datasets to DSS. The `forecast`

library
has the functions we need for training models to predict time series. The `dplyr`

package has functions for manipulating data.frames.

```
library(dataiku)
library(forecast)
library(dplyr)
```

Then, I’m going to load the data into R from DSS

```
ds <- "AIRLINE_PASSENGERS.internation_airline_passengers_cleaned"
intl_passengers <- read.dataset(ds)
head(intl_passengers)
```

Great! Now that we’ve loaded our data, let’s create a time series object using the `ts()`

function.

This function takes a numeric vector, the start time and the frequency of measurement. For us, these values are the number of international passengers, 1949 (the year for which the measurements begin) and a frequency of 12 (months in a year).

```
ts_passengers = ts(intl_passengers$International_Passengers,
start=1949,
frequency=12)
plot(ts_passengers)
```

Excellent. We have our time series. It’s time to start modeling!

I’m going to try three different forecasting methods and deploy the best to DSS. In general, it’s good practice to try several different modeling methods and go with whichever provides the best performance.

The `ets()`

function in the `forecast`

package fits exponential state smoothing (ETS) models. This function automically optimizes the choice of model and necessary parameters. All you have to do is providing it with a time series.

Let’s use it and then make a forecast for the next 24 months.

```
m_ets = ets(ts_passengers)
f_ets = forecast(m_ets, h=24) # forecast 24 months into the future
plot(f_ets)
```

Looking good! The forecast is shown in blue with the grey area representing a 95% confidence interval. Just by looking, we see that the forecast roughly matches the historical pattern of the data.

The `auto.arima()`

function provides another modeling method. More info on the ARIMA model can be found here. The `auto.arima()`

function automatically searches for the best model and optimizes the parameters. Using the `auto.arima()`

is almost always better than calling the `Arima()`

function directly.

Let’s give it a shot.

```
m_aa = auto.arima(ts_passengers)
f_aa = forecast(m_aa, h=24)
plot(f_aa)
```

Great! These confidence intervals seem bit smaller than those for the ETS model. Maybe this is because of a better fit to the data, but let’s train a third model before doing a model comparison.

The last model I’m going to train is a TBATS model. This model is designed for use when there are multiple cyclic patterns (e.g. daily, weekly and yearly patterns) in a single time series. Maybe it will be able to detect complicated patterns in our time series.

```
m_tbats = tbats(ts_passengers)
f_tbats = forecast(m_tbats, h=24)
plot(f_tbats)
```

Now we have three models that all seem to give reasonable predictions. Let’s compare them to see which is performing the best.

I’m going to use AIC to compare the different models. AIC is common method for determining how well a model fits the data, while penalizing more complex models. The model with the *smallest* AIC is the best fitting model.

```
barplot(c(ETS=m_ets$aic, ARIMA=m_aa$aic, TBATS=m_tbats$AIC),
col="light blue",
ylab="AIC")
```

We see that the ARIMA model performs the best. So, let’s go ahead and turn our interactive R code into an R recipe that can be built into our DSS workflow.

But before we can do this, we have to turn the output of `forecast()`

into a data.frame, so that we can store it in DSS.

First, I’m going to find the last date for which we have a measurement.

```
last_date = index(ts_passengers)[length(ts_passengers)]
```

Then, I’m going to create data.frame with the prediction for each month. I’m also going to include the lower and upper bounds of the predictions, and the date. Since we’re representing dates by the year, each month is 1/12 of a year.

```
forecast_df = data.frame(passengers_predicted=f_aa$mean,
passengers_lower=f_aa$lower[,2],
passengers_upper=f_aa$upper[,2],
date=last_date + seq(1/12, 2, by=1/12))
```

Finally, we split the date column into separate columns for year and month.

```
forecast_df = forecast_df %>%
mutate(year=floor(date)) %>%
mutate(month=round(((date %% 1) * 12) + 1))
```

All together the code is

```
last_date = index(ts_passengers)[length(ts_passengers)]
forecast_df = data.frame(passengers_predicted=f_aa$mean,
passengers_lower=f_aa$lower[,2],
passengers_upper=f_aa$upper[,2],
date=last_date + seq(1/12, 2, by=1/12))
forecast_df = forecast_df %>%
mutate(year=floor(date)) %>%
mutate(month=((date %% 1) * 12) + 1)
```

Awesome! Now we have everything we need to deploy the model onto DSS: the code to create the forecast for the next 24 months and the code to convert the result into a data.frame.

To deploy our model, we need to create a new R recipe. To do this, click on `interan_airline_passengers_cleaned`

, then click on the R icon on the right, then click on “Recipe Create new datasets using R code”.

I’m going to create a new managed dataset, `forecast`

, for the output of my recipe.

Once I click on create at the top center of the screen, DSS opens a text editor with the basics of an R recipe. I’m going to copy and paste the code for the ARIMA model in between the `read.dataset()`

and `write.dataset_with_schema()`

calls.

The final recipe looks like this.

That’s it! Now we can click on run at the bottom of the page and return to the DSS flow where we see our newly created forecast dataset.

Clicking on the `forecast`

dataset let’s us look at our new predictions stored as a DSS dataset.

If you thought this was helpful, check out our other tutorials.