Let’s get started!
In this tutorial, you will create your first machine learning model by analyzing the historical customer records and order logs from Haiku T-Shirts.
This is a two-part tutorial:
- First, we’ll create and improve your first model.
- Then, we’ll deploy this predictive model to score new records, like in a real application.
The goal of this tutorial is to predict whether a new customer will become a high-value customer, based on the information gathered during their first purchase.
This tutorial assumes that you have completed Tutorial: From Lab to Flow prior to beginning this one!
From Dataiku DSS home page, click on the DSS Tutorials button in the left pane, and select Tutorial: Machine Learning. Click on Go to Flow. In the flow, you see the steps used in the previous tutorials to create, prepare, and join the customers and orders datasets.
Additionally, there is a dataset of “unlabeled” customers representing the new customers that we want to predict. These customers have been joined with the orders log and prepared in much the same way as the historical customer data.
Based upon the joined customer and order data, our goal is to predict (i.e. guess) whether the customer will become a “high revenue” customer. If we can predict this correctly, we could assess the quality of the cohorts of new users or more effectively drive acquisition campaigns and channels.
In the flow, select the customers_labeled dataset and click on the LAB button to create a new visual analysis. Give the analysis the more descriptive name
High revenue analysis.
Our labeled dataset contains personal information about the customer, his device and his location. The last column high_revenue is a the flag for customers generating a lot of revenue based on their purchase history. It will be used as the target variable of our modeling task.
Now let’s build our first model!
Click on the Models tab in the visual analysis. A modal appears to choose the type of modeling you want to perform.
Different kinds of modeling tasks
Prediction models are learning algorithms that are supervised, e.g. they are trained on past examples for which the actual values (the target column) is known. The nature of the target variable will drive the kind of prediction task.
- Regression is used to predict a real-valued quantity (i.e a duration, a quantity, an amount spent...).
- Two-class classification is used to predict a boolean quantity (i.e presence / absence, yes / no...).
- Multiclass classification is used to predict a variable with a finite set of values (red/blue/green, small/medium/big...).
Clustering models are inferring a function to describe hidden structure from "unlabeled" data. These unsupervised learning algorithms are grouping similar rows given features.
Here, we want to predict high_revenue. Let us choose the Prediction option, and select the high_revenue variable. Dataiku DSS provides various templates to create models depending on what you want to achieve (for example, either using machine learning to get some insights on your data or creating a highly performant model). Let us keep the default Balanced template on the Python backend and click Create in the popup. Click Train on the next screen.
Dataiku guesses the best preprocessing to apply to the features of your dataset before applying the machine learning algorithms.
A few seconds later, Dataiku presents a summary of the results of this modeling session. By default, 2 classes of algorithms are used on the data:
- a simple generalized linear model (logistic regression)
- a more complex ensemble model (random forest)
The model summaries contain some important information:
- the type of model
- a performance measure; here the Area Under the ROC Curve or AUC is displayed
- a summary of the most important variables in predicting your target
The AUC measure is handy: the closer to 1, the better the model. Here the Random forest model seems to be the most accurate. Click on it, and you will be taken to the main Results page for this specific model.
The Summary tab showed an AUC value of about 0.762, which is pretty good for this type of application. (your actual figure might vary slightly, due to differences in how rows are randomly assigned to training and testing samples.)
To get a better understanding of your model results, Dataiku DSS also offers several different outputs.
Going down the list in the left panel, you will find a first section called Interpretation, showing information about the contribution of the different variables in the model. Keep in mind that the values here are algorithm dependent (i.e for a linear model you’ll find the model’s coefficients, or for tree-based methods this will be related to the numbers of splits on a variable weighted by the depth of the split in the tree), but this provides very useful information:
We notice that some variables seem to have a strong relationship with being a high-value customer. Notably, the age at the time of first purchase seems to be a good indicator.
Following the Interpretation section, you will find a Performance section.
The Confusion Matrix compares the actual values of the target variable with predicted values (hence values such as false positives, false negatives…) and some associated metrics: precision, recall, f1-score. A machine learning model usually outputs a probability of belonging to one of the two groups, and the actual predicted value depends on which cut-off threshold we decide to use on this probability (i.e at which probability do we decide to classify our customer as a high value one)
The Confusion Matrix is given for a given threshold, that you can change using the slider at the top:
The Decision Chart represents precision, recall, and f1 score for all possible cut-offs:
The next two tabs, Lift charts and ROC curve are visual aids, perhaps the most useful, to assess the performance of your model. While of course there is a longer version about their construction and their interpretation, you can now remember that in both cases, the steeper the curves are at the beginning of the graphs, the better the model.
In our example again, the results look pretty good:
Finally, the Density chart shows the distribution of the probability to be high-value customer, compared across the two actual groups. A good model will be able to separate the 2 curves as much as possible, as we can see here:
The last section, Model Information is a recap about how the model has been built. If you go the Features tab, you will notice some interesting things:
By default, all the variables available except customerID have been used to predict our target. Dataiku DSS has rejected customerID because this feature was detected as an unique identifier and was not helpful to predict high-profile customers. Criteria like the geopoint is probably not really interesting in a predictive model, because will not allow it to generalize well on new records. We may want to refine the settings of the model.
To change the way models are built, go back to the models list page by clicking on the Models link, and going to the Design page.
To address the issue about how we use the variables, proceed directly to the Features Handling tab. Here DSS will let you tune different settings.
The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model. Here, we want to remove the ip_address_geopoint from the model. Click on ip_address_geopoint and hit the Reject button:
The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:
- Numerical variables are real-valued ones. They can be integer or numerical with decimals.
- Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender… Also, you will probably face many times variables that look like Numerical but are in fact Categorical. This will be the case for instance when an id is used in lieu of the actual value.
- Text is meant for raw blocks of textual data, such as a Tweet, or customer review. Dataiku DSS is able to handle raw text features with specific preprocessing.
Each type can be handled differently. For instance, the numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1). You can disable this behavior by selecting again both names in the list, and click the No rescaling button:
After altering these settings, you can now click on Train and build new models:
The performance of the random forest model slightly increases:
Under Design, click the Feature generation tab. We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features. Sometimes these generated features can reveal unexpected relationships between the inputs and target.
When done, you can train your model again by clicking on the Train button:
The resulting Random Forest beats the previous one – the AUC value is now higher than in either of the first two models – possibly because of the changes we made to the handling of features. Looking at the variables importance chart for the latest model, the importance is spread across campaign and the features automatically generated from age_first_order and pages_visited_avg, so the generated features may have uncovered some previously hidden relationships. On the other hand, the increase in AUC isn’t huge, so it may be best to be grateful for the boost without reading too much into it.
Now that you have trained several models, all the results probably do not fit your screen anymore. To see all your models at a glance, you can switch to the Table view, which can be sorted on any column. Here we have sorted on ROC AUC.
Congratulations, you just built your first predictive model using DSS!
If you want to see how far you can go with your model, or how you can actually use it in an automated way to score new data as they arrive, follow part 2