The EventsAggregator plugin generates aggregated features on a SQL dataset that contains events (i.e. with a date column and some additional features). The generated features can be used in order to train machine learning algorithms.
For example, if you are an e-commerce website and you want to predict churn, the plugin will generate features per customer from a dataset of past orders (example of columns: timestamp, customer id, product category, product price), such as the frequency of past orders, the percentage of purchases per category. Another example is fraud detection where the plugin can generate features per customer for each event on past events .
The plugin can generate more than 20 different types of aggregated features, such as the frequency of the events, recency of the events, the distribution of the features, etc., over several time windows and populations that you can define.
(Note: this plugin requires SQL dataset as input)
|Author||Dataiku (Du Phan, Joachim Zentici)|
How To Use
A step-by-step tutorial is available here.
Each row of your input dataset should correspond to an event. This means it:
- has a timestamp (i.e. has a data field)
- belongs to a group, using one or several aggregation keys (eg: user_id, shop_id,…)
- has a list of input features
Let’s take an example. If you have two aggregation keys user_id and shop_id, a group is constituted by a combination of the variables.
The recipe has two aggregation level to create the features:
- By group.
In this case, the features are aggregated by group and the output will have one row per group.
- By event.
In this mode, for each event, all events before this event will be aggregated together. The output will contain one row per event.
You can also filter events after a given date by setting a reference date. All events after this date will be removed.
Concerning the input features used to generate the aggregated features, you can either manually select categorical and numerical columns or you can let the guess automatically infer the features types.
Additionally, you can choose one or several temporal windows for which the features will be generated among the following options:
- all history of events (default setting)
- if aggregation mode is “related to a fixed reference date”:
- Past N weeks
- Past N months
- Past N years
- If aggregation mode is “for all events dates”:
- Past N rows
Optionally, while in the “events” aggregation mode, you can choose to use a
buffer. In that case, for each event, when aggregating we will jump back N events. For example in sales forecasting, where we want to predict a user’s transactions in the next 3 weeks, we cannot use the data of the 3 weeks prior to his transaction. Here a buffer of 3 weeks will then be used.
The generated aggregated features are organized into different families as explained below. You will be able to activate each family individually in order to save computation effort.
- An input dataset for which we wish to create events aggregated features.
- An output dataset with the newly created features.
- An output folder containing the
feature_aggregator.jsonwith the information of the different recipe parameters as well as distribution information.
Family of Generated Features
X an event,
X_N its numerical features and
X_C its categorical features, we have implemented the following families of features.
- Frequency of event X
- Mean frequency of event X
- Number of days since first event X.
- Recency of event X.
- Time interval of event X.
- Mean time interval between event X in days.
- Standard deviation interval between event X in days.
- Delta time interval of event X.
- Mean delta time interval between event X in days.
- Standard deviation delta interval between event X in days.
- Number of distinct values of the categorical feature X_C.
- Sum of numerical feature X_N.
- Average of numerical feature X_N.
- Std of numerical feature X_N.
- Distribution (warning: expensive to calculate and will generate a lot of features (one per distinct value per feature per temporal window)):
- Distribution of the values in a categorical feature X_C:
- Percentage of value c_1 of the column X_C.
- Percentage of value c_2 of the column X_C.
- Is it amount with cent?
- Is it round_number (like 30000 or 400 ?)
- Rough_size: what is the size of the amount? (in power of 10, 100, 1000, …)
- Is it max/min price?
- Ratio with max/min price
- Cross reference count (only in “aggregation by subject and timestamp” mode):
- Number of distinct rows having the same value. For example: number of users having the same IP_address