howto

Events Aggregator

Applies to DSS 5.0 and above | October 05, 2018

Feature Factory is our global initiative for automated feature engineering. It aims to reduce the countless hours data scientists spend building insightful features.

The first outcome of this initiative is the EventsAggregator plugin, which works with data where each row records a specific action, or event, and the timestamp of when it occurred. Web logs, version histories, medical and financial records, and machine maintenance logs are all examples of events data.

The plugin can generate more than 20 different types of aggregated features per aggregation group, such as the frequency of the events, recency of the events, the distribution of the features, etc., over several time windows and populations that you can define. The generated features can outperform raw features when training machine learning algorithms.

Events Aggregator Settings

To use the plugin, there are critical pieces of information you need to provide:

  1. A column that records the timestamp of when each event in the dataset occurred
  2. One or more columns (aggregation keys) that define the group the event belongs to. For example, in an order log, the customer id provides a natural grouping.
  3. The level of aggregation. The examples below will go into greater detail of how to use the level of aggregation.

    • In the By group case, the features are aggregated by group and the output will have one row per group.
    • In the By event case, for each event, all events before this event will be aggregated together. The output will contain one row per event.

Install the Plugin

First we need to install the EventsAggregator plugin. This requires Administrator privileges on the Dataiku DSS instance.

Log in as the Dataiku DSS Administrator, and from the Application menu in the top navigation bar, choose Administration.

Dataiku administration tools, Plugin tab

Navigate to the Plugins tab and click Store. Search for “events aggregator”.

The plugin requires a dedicated code environment. Choose to create it.

How to Generate Features by Group

In this mode, features are aggregated by group and the output will have one row per group.

For example, we may have a dataset recording customer activity on an e-commerce website. Each row corresponds to an event at a specific timestamp. For a given fixed date, we want to predict who’s more likely to churn. So we need to group the input dataset by user. The output dataset will then have one row per user.

For reference, the final flow will look like the following.

Final flow showing use of Events Aggregator plugin

Preparing for the Events Aggregator

The flow begins with two CSV source files. Download the archive containing these files, extract them from the archive, and use them to create two new Uploaded Files datasets. In the user_activity_csv dataset we’ve created, we need to set the storage of the event_timestamp column to date and the price column to double.

Next, we need to Sync these datasets to SQL datasets, because the plugin works on SQL datasets. Now the user_activity SQL dataset is ready for the Events Aggregator.

Applying the Events Aggregator

In the Flow, select +Recipe > Feature factory: events-aggregator to open the plugin recipe. Set user_activity as the input dataset, create user_features as the output dataset, and create the recipe.

We want to aggregate features for each customer, ending with a dataset that has one row per customer, so we define the groups by user_id and select By group as the level of aggregation. We also select event_timestamp as the column that defines when the events occurred.

Events Aggregator recipe

We chose to manually specify which columns should be considered raw input features. These columns are used to generate the output aggregated features.

We chose to add a window of the last 6 months, in order to capture the mid-short term trend, and checked the box to add a window for all history. This means that in addition to computing features on the entire event history, the recipe will generate features for events in the 6 months prior to the present.

If we wanted to generate features for each 6 month window going back for a year, we would change the setting of Windows of each type from 1 to 2.

Events Aggregator recipe

Machine Learning

The output dataset now contains hundreds of features. We join this dataset with the user_label dataset, split the resulting dataset into train and test sets, and can now directly use the new features for predicting the label in the Visual Machine Learning interface. We deployed a Random Forest to the flow, and used it to score the test set.

Random forest model built using features automatically generated by Events Aggregator recipe

How to Generate Features by Event

In this mode, for each event, all events before this event will be aggregated together. The output will contain one row per event.

For example, we may have a dataset recording sensor activity for different machines. Each row corresponds to an event at a specific timestamp. For any given time, we want to predict how long it will be until the machine will fail. So we need to group the input dataset by machine, and aggregate by event. The output dataset will then have one row per engine and timestamped event.

For reference, the final flow will look like the following. Note: in the flow below, we loaded the data into PostgreSQL tables, so the flow begins with PostgreSQL datasets, rather than Uploaded Files datasets.

Final flow showing use of Events Aggregator plugin

Preparing for Events Aggregation

The flow begins with four datasets: train observations, test observations, and their corresponding label datasets. Both the train and test observation datasets contain sensor measurements of the engines as well as operational settings conditions. The label datasets contain information on how long it will be before a given engine at a given timestamp will require maintenance.

These datasets are based on data simulated by NASA to better understand engine degradation under different operational conditions. We have modified the FD002 sets to include a time column. You can download the archive containing our modified files, extract them from the archive, and use them to create four new Uploaded Files datasets.

Next, we can use a Prepare recipe to sync these datasets to SQL datasets, because the plugin works on SQL datasets. Now the train_obs and test_obs SQL datasets are ready for the Events Aggregator.

Applying the Events Aggregator

In the Flow, select +Recipe > Feature factory: events-aggregator to open the plugin recipe. Set user_activity as the input dataset, create user_features as the output dataset, and create the recipe.

We want to build rolling features for each engine, ending with a dataset that has one row per event per engine, so we define the groups by engine_no and select By event as the level of aggregation. We also select time as the column that defines when the events occurred.

Events Aggregator recipe

We’ll allow the recipe to automatically select the raw features, and we won’t define any temporal windows.

Machine Learning

The output datasets now have over 100 features. We join the observations datasets with the labeled datasets, and can now directly use the new features for predicting the label in the Visual Machine Learning interface. We deployed a Random Forest to the flow, and used it to score the test set.

Random forest model built using features automatically generated by Events Aggregator recipe