Tutorial: Automation

Welcome

The lifecycle of a data project doesn’t end when a flow is complete. It’s not possible to Dev/Test a model to the point where it’s “perfect” because data is always changing. However, ensuring that your data and models are up-to-date is something that you don’t want to have to do manually. Dataiku DSS has automation features to help with this.

In this tutorial, you will learn the basics of:

  • scheduling jobs using scenarios
  • monitoring jobs
  • monitoring the status and quality of your dataset

We will work with the fictional retailer Haiku T-Shirt’s data.

Prerequisites

This tutorial assumes that you have some familiarity with the Dataiku basics.

Create your project

Click on the Tutorials button in the left pane, and select Tutorial: Automation in Dataiku DSS. Click on Go to Flow.

There are a few things to note in the flow:

  • The source data come from files in managed folders to create the Customers and Orders datasets. For the purposes of this tutorial, assume that these folders are updated regularly by some process external to Dataiku DSS.
  • The datasets are joined so that the order data is enriched with the customer data.
  • The joined data is prepared, at which point the Flow forks.
  • One branch builds a random forest model to predict whether a customer will become high revenue. This is similar to the predictive model built in Tutorial: Machine learning.
  • The other branch filters rows to identify customers who have made their 1st order, so that we can monitor whether the company continues to attract new customers.

For the purposes of this tutorial, the Flow is already complete, and our goal is to automate and monitor retraining of the model and identification of first-time customers.

Automating tasks with a scenario

We will create a simple Dataiku DSS scenario to automate the tasks of retraining the cluster model and identifying first-time users. Navigate to the Scenarios tab of the Jobs section of the UI. Click + New Scenario. Name the scenario Rebuild data and retrain model and click Create.

Key concept: scenarios

In DSS, the scenario is the way to automate tasks (most commonly, building or rebuilding some datasets and models)

There are two required components to a scenario:

  • the triggers that activate a scenario and cause it to run, and
  • the steps, or actions, that a scenario takes when it runs

There are many predefined triggers and steps, making the process of automating flow updates flexible and easy to do. For greater customization, you can create Python triggers and steps.

While optional, it's often useful to set up a reporting mechanism to notify you of scenario results (for example: send a mail if the scenario fails).

You define triggers on the Settings tab of a scenario. Say we want to track our customer acquisition and retrain the model as new data arrives and the underlying data source has changed. We want a trigger that will “watch” the input datasets, and run the scenario as soon as the data in the inputs has changed.

  • Click Add Trigger and select Trigger on dataset change.
  • Name the trigger Customers or Orders.
  • Choose to check the trigger condition every 30 seconds.
  • Add both Customers and Orders as the datasets to monitor for changes.

Now navigate to the Steps tab of the scenario to define the actions it takes.

  • Click Add Step and select Build / Train.
  • Name the step Filtered orders.
  • Select Build required datasets as the build mode
  • Click + Dataset and select Orders_filtered as the dataset to rebuild.

This scenario will now rebuild the Orders_filtered dataset whenever the Orders or Customers datasets changes. Since the Build mode is set to build all required datasets, Dataiku will check to see if the source datasets have changed, and update each dataset in the flow as appropriate.

Now let’s add the step to retrain the model.

  • Click Add Step and select Build / Train.
  • Name the step High revenue prediction.
  • Click + Model and select High revenue prediction.

The scenario is now ready to use. To test it, click Run Now. Click Last runs to follow the progress of the scenario. You can examine details of the run, what jobs are triggered by each step in the scenario, and the outputs produced by each scenario.

This scenario will now run whenever the Orders or Customers datasets changes, and rebuild the Orders_filtered dataset and the high revenue prediction model.

Over time, we can track scenarios on the Monitoring tab to view patterns of successes and failures.

Before we move on, make sure to enable the “Auto-trigger” of the scenario. When you create a scenario, by default, its triggers are inactive.

  • Go to the Settings tab of the scenario
  • Set Auto-triggers to On
  • Save the scenario

Monitoring with metrics and checks

In addition to the Monitoring dashboard, we can use metrics and checks to monitor:

  • whether the company continues to attract new customers at the expected rate, and
  • whether the model degrades over time

Key concept: Metrics and checks

The Dataiku DSS metrics system provides a way to compute various measurements on objects in the Flow, such as the number of records in a dataset or the time to train a model.

The Dataiku DSS checks system allows you to set up conditions for monitoring metrics. For example, you can define a check that verifies that the number of records in a dataset never falls to zero. If the check condition is not valid anymore, the check will fail, and the scenario will fail too, triggering alerts.
You can also define advanced checks like "verify that the average basket size does not deviate by more than 10% compared to last week"

By combining scenarios, metrics, and checks, you can automate the updating, monitoring and quality control of your Flow.

First, let’s set up metrics and checks for the dataset. Navigate to the Status tab of the Orders_filtered dataset. Ensure that the Record Count metric is displayed (note that it may not have a value).

  • Navigate to the Edit tab.
  • On the Metrics panel, ensure that the Records count is automatically computed after a rebuild (click “Yes” in “Auto compute after build”)
  • Navigate to the Checks panel.
  • Here we’ll create a new check to ensure that the number of new customers is above a minimum value.
  • Click Metric value is in a numeric range, then name the new check Minimum new customers.
  • Select Record count as the metric to check, and set 170 as the soft minimum for the number of new customers to acquire in a month.
  • Save

Setting a soft minimum ensures that the dataset is built, while warning us if there is a problem. Click Check to test the check.

Now we need to display the check. Navigate to the Checks tab and select to display the Minimum new customers check.

Now let’s set up metrics and checks for the model.

  • Go back to the flow, and open (double-click) the prediction model)
  • Navigate to the Metrics & Status tab of the High revenue prediction model.
  • On the View tab, ensure that AUC (the area under the ROC curve) is displayed.
  • Navigate to the Settings tab, and then the Status checks tab.
  • Here we’ll create a new check to monitor performance of the model over time.
  • Click Metric value is in a numeric range, then name the new check AUC.
  • Select AUC as the metric to check, and set 0.7 as the minimum and 0.9 as the soft maximum for the area under the ROC curve.

Setting a hard minimum ensures that if the model performance degrades too much, the model retraining fails and the new model is not put into production. The soft maximum allows the model to be put into production, while warning us that the model performance may be suspiciously high and should be investigated. Click Check to test the check, then click Save.

Now we need to activate the check. Navigate to the Checks tab of the Metrics & Status tab and ensure that AUC is displayed.

Test it all

Let’s see how well we’ve done in setting up the scenario and monitoring. To do this, we’re going to make some slight modifications to the flow in order to simulate a change to the underlying data.

First, open the Filter recipe. This recipe uses the Dataiku formula language to identify people who newly became customers in the 30 days prior to March 1, 2017. Specifically:

  • The diff() function call computes the difference in days between each order date and March 1, 2017, and we then check whether this difference is less than 30
  • We also check whether this is the first order the customer has made (val('order_number') == 1)

… and if both of these are true, then we know this is a new customer.

Change the formula to look back 30 days from April 1, 2017 by altering the asDate() function call to asDate("2017-04-01T00:00:00.000Z"). In a real-life flow, we would look back from today(), but the CSV files we have available only have data through March 2017.

Now, navigate to the Orders dataset. Click the Settings tab and then click Connection. The dataset specification uses a regular expression to select which files to use as source data; the expression ^/?orders_201.*[^(03)].csv$ excludes the data from March 2017. Change the file specification to /orders.*. This will cause the Orders dataset to use all of the CSV files in the Orders folder as its source data.

After a short period (around 2 minutes), Dataiku DSS recognizes that the underlying data has changed, and the scenario automatically runs. After the run, the Checks tab for the Orders_filtered dataset shows that fewer than 170 new customers were added in the last 30 days.

Next steps

Congratulations! You have created your first automation scenarios and monitoring checks. See the related information links on the right for more on automation.

See the next tutorial on deploying to production to learn how to put your automation scenarios and monitoring into a production environment.