The EventsAggregator plugin generates aggregated features on a dataset that contains events (i.e. with a date column and some additional features). The generated features can be used in order to train machine learning algorithms.
For example, if you are an e-commerce website and you want to predict churn, the plugin will generate features per customer from a dataset of past orders (example of columns: timestamp, customer id, product category, product price), such as the frequency of past orders, the percentage of purchases per category. Another example is fraud detection where the plugin can generate features per customer for each event on past events .
The plugin can generate more than 20 different types of aggregated features, such as the frequency of the events, recency of the events, the distribution of the features, etc., over several time windows and populations that you can define.
|Author||Dataiku (Du Phan, Joachim Zentici)|
Each row of your input dataset should correspond to an event. This means it:
Let's take an example. If you have two aggregation keys user_id and shop_id, a group is constituted by a combination of the variables.
The recipe has two aggregation level to create the features:
You can also filter events after a given date by setting a reference date. All events after this date will be removed.
Concerning the input features used to generate the aggregated features, you can either manually select categorical and numerical columns or you can let the guess automatically infer the features types.
Additionally, you can choose one or several temporal windows for which the features will be generated among the following options:
Optionally, while in the “events” aggregation mode, you can choose to use a
In that case, for each event, when aggregating we will jump back N events. For example in sales forecasting, where we want to predict a user’s transactions in the next 3 weeks, we cannot use the data of the 3 weeks prior to his transaction. Here a buffer of 3 weeks will then be used.
The generated aggregated features are organized into different families as explained below. You will be able to activate each family individually in order to save computation effort.
feature_aggregator.jsonwith the information of the different recipe parameters as well as distribution information.
X an event,
X_N its numerical features and
X_C its categorical features, we have implemented the following families of features.