DSS 103: Your first Machine Learning model

Since DSS 4.0.8, this tutorial is outdated. Please head over to the new Machine Learning tutorial

In this tutorial, we will create your first machine learning model. We will analyze the historical orders data of a fictional T-shirt making company called “Haiku T-Shirt”.

This is a two-part tutorial:

  • First, we’ll create and improve your first model
  • Then, we’ll deploy this predictive model to score new records, like in a real application

We will learn step-by-step how to predict whether a new customer will become a high-value customer after one year, based on the information gathered during their registration.

On our way through this hands-on scenario, we will go through the following concepts of Dataiku DSS:

  • how to build a predictive model
  • how data enrichment can improve the prediction accuracy

Let’s get started!


You will be more comfortable after having completed the Tutorial 101 and the Tutorial 102 prior to beginning this one!

If you have not already done it, start the DSS 103 tutorial in DSS, by clicking on the “DSS Tutorials” button. Select the Tutorial 103.

Creating the variable of interest

Our project comes preloaded with a dataset called interactions_history containing past users’ interactions and the yearly income they generated.

The dataset has the following columns:

  • user_id
  • birth
  • country
  • page_visited (number of visited pages on the website during the first visit)
  • first_item (price of the first purchased item)
  • gender
  • campaign (whether the user came as part of a marketing campaign)
  • high_revenue (after one year, was it a high-profile customer?)

Each line corresponds to a customer.

The interactions_history dataset with a row highlighted.

Our goal is to predict (i.e. guess) whether the customer will become a “high revenue” customer, based on the money spent during its very year of activity. If we could predict this correctly, we could for instance assess the quality of the cohorts of new users or drive effectively acquisition campaigns and channels.

Let’s go! Click on the LAB button from the interactions_history dataset view and create a new visual analysis:

The interactions_history dataset with the Lab button indicated.

The “target” variable, i.e our variable of interest has already been computed, it is the high_revenue variable.

If you switch to the Charts tab, you can easily have a look at the distribution of your target variable. Drag the high_revenue column in the X axis area, and Count of records in the Y axis area, and a bar chart is displayed.

The Charts tab of the interactions_history visual analysis, with a bar chart of high_revenue customers.

As you can see, there are fewer high revenue customers. Now let’s try to predict if a customer has a high revenue or not.

Different kind of modeling tasks

Having a "target" variable, i.e. a variable you are trying to predict, will make you use supervised learning algorithms. The nature of the target variable will drive the exact kind of task.

Regression is used to predict a real valued quantity (i.e a duration, a quantity, an amount spent...).

Binary classification is used to predict a binary variable (i.e presence / absence, yes / no...).

Multiclass classification is used to predict a variable with a finite set of values (red/blue/green, small/medium/big...).

You may also not use any target variable. In this case, unsupervised learning techniques will be used.

Dataiku DSS lets you address both supervised and unsupervised problems.

Predicting whether a customer will be of high value

In Script tab, click now on the high_revenue column header, then click on “Create Prediction model…”:

Creating a prediction model in interactions_history visual analysis.

Click CREATE on the popup showing up. And then click TRAIN.

Creating a prediction model in interactions_history visual analysis.

DSS guesses the best preprocessing to apply to your dataset before applying the machine learning algorithms.

A few seconds later, you will be presented with a new screen, summarizing the results of different algorithms that were tested against your dataset. By default, DSS will compare 2 classes of algorithms:

  • a simple linear model (logistic regression)
  • a more complex ensemble model (random forest)

Training overview for logistic regression and random forest models

A few important information are shown here:

  • the type of model
  • a synthetic performance measure, here the “Area Under the ROC Curve” or AUC is displayed
  • a summary of the importance of the variables vs. your target variable

The AUC measure is handy: the closest to 1, the better the model. Here the Random Forest model seems to be the most accurate. Click on it, and you will be taken to the main Results page for this specific model.

Random forest model output summary

Understanding prediction quality and model results

The Summary tab showed an AUC value of about 0.952, which is already very good. (Note: actual figure might very slightly vary)

To get a better understanding of your model results, DSS also offers several different outputs.

Going down the list in the left panel, you will find a first section called Interpretation, showing information about the contribution of the different variables in the model. Keep in mind that the values here are algorithm dependent (i.e for a linear model you’ll find the model’s coefficients, or for tree-based methods this will be related to the numbers of splits on a variables weighted by the depth of the split in the tree), but this provides very useful information:

Random forest variable importance chart

We notice that some variables seem to have a strong relationship with being a high-value customer. Notably, the value of first item purchased seems to be a good indicator.

Following the Interpretation section, you will find a Performance section.

The first two tabs are related. The Confusion Matrix will compare the actual values of the target variable with predicted values (hence values such as false positives, false negatives…) and some associated metrics: precision, recall, f1-score. A machine learning model usually outputs a probability of belonging to one of the two groups, and the actual predicted value depend on which cut-off threshold we decide to use on this probability (i.e at which probability do we decide to classify our customer as a high value one)

The Confusion Matrix is given for a given threshold, that you can change using the slider at the top, while the Decision Chart represents precision, recall, and f1 score for many different cut-offs:

Random forest confusion matrix

Random forest decision chart

The next two tabs, Lift charts and ROC curve are visual aids, perhaps the most useful, to assess the performance of your model. While of course there is a longer version about their construction and their interpretation, you can now remember that in both cases, the steeper the curves are at the beginning of the graphs, the better the model.

In our example again, the results look pretty good:

Random forest lift charts

Random forest ROC curve

Finally, the Density chart is a bit more specific: it will show the distribution of the probability to be high-value customer, compared across the two actual groups. A good model will be able to separate the 2 curves as much as possible, as we can see here:

Random forest desnsity chart

The last section, Model Information is a recap about how the model has been built. If you go the Features tab, you will notice some interesting things:

Information about feature handling in random forest model

By default, all the variables available except “user_id” have been used to predict our target variable. DSS has rejected “user_id” because this feature was detected as an unique identifier and was not helpful to predict high-profile customers. Criteria like the raw birth date is not really interesting in a predictive model, it will not allow it to generalize well on new records. We may want to refine the settings of the model.

Tuning the settings of a model

To change the way models are built, go back to the models list page by clicking on the back Models link, and hit the Settings button:

Prediction model training overview with Settings button highlighted

To address the issue about how we use the variables, proceed directly to the Features tab. Here DSS will let you tune different settings.

The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model. Here, we want to remove the birth date from the model. Click on birth and hit the Reject button:

Prediction model features, manually rejecting a feature

The Type of the variable is very important:

  • Numerical variables are real-valued ones, i.e calculating statistics like average, standard deviations, etc… make sense
  • Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender… Also, you will probably face many times variables that look like Numerical but are in fact Categorical. This will be the case for instance when an id is used in lieu of the actual value.
  • Lastly, Text is meant for raw blocks of textual data, such as a Tweet, or customer review. DSS will let use raw text directly in your predictive model.

Each type can be handled differently. For instance, the numerical variables page_visited and first_item have been automatically normalized using a Rescaling (this means that the values are normalized to be between mostly around 0). You can disable this behavior by selecting again both names in the list, and click the No rescaling button:

Prediction model features, changing the rescaling on multiple features at once

Categorical values are a bit more tricky to manage. Since the underlying engine expects all values to be numeric, these variables need to be transformed first. One way to do that is to dummy-encode them (a.k.a creating indicator variables or vectorization), where a new column will be created for each value of the initial column. This is the default behavior in DSS.

For example, look at the country variable, it has 132 possible values, meaning that the 32 least frequent values are combined into an Others category. You can make sure DSS includes all the values by increasing the Max. Nb. Categories setting to 132:

Prediction model features, changing the maximum number of features for a categorical feature

These settings being done, you can now click on Train and rebuild new models:

Prediction model, training two models

The performance slightly decreases, but this is for the sake of better robustness:

Training overview for logistic regression and random forest models

Increasing accuracy with features engineering

A very certain way to enhance a predictive model accuracy is to spend time on the “Features engineering” step, which means augmenting your dataset with new, derived variables. For instance, we dropped the birth date variable from our dataset, but we could have used it to derive the age of the customer. To do so, go back to the Script tab:

Training overview for logistic regression and random forest models

To compute the age of the customer, we first need to parse the date stored the birth column into a proper format. For the column header, click on Parse date and select the most suitable format:

Visual analysis script, parse birthdate

From the newly create birth_parsed column header, you are now able to use the Compute time since processor to get the age of the customer. In the step on the left tab, select Years instead of days and name the output column age:

Visual analysis script, compute time singce birth

Drop the birth_parsed column, and switch back to the Models tab:

Visual analysis script, Models tab highlighted

Under Settings, click of the Features tab again, and make sure the new age column is there. Set the Rescaling strategy to No rescaling to make it consistent with the other variables:

Prediction model settings, manually changing rescaling for age feature

When done, you can train again your model using this new variable, by clicking on the Train button:

Train prediction model

The resulting new Random Forest largely beats the previous one, the AUC value is now 0.961 (vs. 0.952 previously)!

Note that you have trained several models, and all the results do not fit your screen anymore probably. To overcome this issue and see all your models at a glance, you can switch to the Table view:

Training overview for multiple prediction models, table view

You now have access to a summarised view of all your models. Isn’t it nice?

Wrap up

Congratulations, you just built your first predictive model using DSS!

If you want to see how far you can go with your model, or how you can actually use it in an automated way to score new data as they arrive, follow the part 2