DSS 103: Your first Machine Learning model, part 2

This is part 2 of the DSS Tutorial 103: your first Machine Learning model

Please make sure that you have completed the first part before starting, since we'll be continuing where we left in part 1.

In this part, we will learn how to actually apply a predictive model to score new records, hence automating the use of this model as you would do for a real application.

More specifically, we will go through the following steps:

  • deploying a model to the Flow
  • using this deployed model to score records from another dataset
  • understanding the different components used by DSS to automate this workflow

What are we going to do?

In the first part, we have trained a model to predict the "high revenue potential" of customers for whom we have already observed their previous long-term behaviour (they were stored in the interactions_history dataset).

Now, we have some new customers, for which we already know the first interactions, and we want to predict whether they'll turn out to be "high revenue customers". This is the interactions_to_score. In this dataset, we do not have the revenue.

Check the prerequisites

Start by going back to your Tutorial 103 project. Go to the Flow, click on the interactions_history dataset, and click on the LAB button.

You should see the visual analysis with the 6 models you trained. Click on.

The dataset should be as you left it at the end of part 1, with the corresponding Script. Go in Models, and click on your best model: the last random forest.

Deploy the model

We are now going to Deploy this model to the Flow, where we'll be able to use it to score another dataset.

Naming and describing models

Did you know that you can give names and descriptions to each individual model? This helps you find out your best models. You can also "star" a model to find it more easily.

Click on the Deploy button on the top right:

A new important popup shows up. It will let you create a new Train recipe. Train recipes, in DSS, are the way to automatically deploy a model in the dataflow, where you can then use it to produce predictions on new records.

Leave the default values for now, and click on the Create button:

You will now be taken back to the Flow view. Two new green items are displayed. The first one is the actual train recipe, and the second one is its output, the model.

Now click on the model icon and look at the right panel.

You have access to some interesting features here. If you choose Open, you will be redirected to a view, similar to the Models one from your previous analysis bench, but focusing only on the model you chose to deploy (the random forest):

Without going into too much details in this tutorial, you can notice that the model is marked as Active version. If your data were to evolve over time (which is very likely in real life!), you would have the ability from this screen to train again your model (clicking on Action and then Retrain). In this case, new versions of the models would be available, and you would be able to select which version of the model you'd like to use.

Go back to the Flow (hey, did you know that you can type the letter "g" on your keyboard, followed by the letter "f"? g + f is a shortcut to get to the Flow!), click again on the actual model output icon, you can see a Retrain button close the Open one. This is a shorcut to the function described above: you can update the model with new training data, and activate a new version.

Finally, the Apply icon is the one we are looking for to use the model:

Click on it, and a popup window shows up. This is where you set up a few things:

  • the dataset you want to score (i.e apply the predictive model on it an get predictions, the "scoring" process). Here it is √¨nteractions_to_score`
  • a name for the newly produced dataset
  • a place where you want to store the results (an important setting, this is how you would write the results directly in a third-party application database for instance)

Fill the values and hit the Create recipe button:

You are now in the scoring recipe.

The threshold is the optimal value computed to maximize a given metric (in part 1). In our case it was set to 0.65. Probability above 0.65 will be classified as high value, below as low value.

You can now click on the Run button at the bottom left to score the second dataset.

Few seconds later your should see Job succeeded.

Go back to the Flow screen, you can visualize your final workflow:

  • start from the "history data"
  • apply a training recipe
  • get a trained model
  • apply the model to get the scores on a dataset where the target is missing.

Get our results

We're almost done. Now, remember, we just added a new recipe. This recipe created a new dataset, interactions_to_score_scored. Double-click on it.

Three new columns have been added:

  • proba_false
  • proba_true
  • prediction

The two "proba" columns are the main outputs. Since we are trying to predict a binary outcome (the fact of becoming a high value customer), the model provides a probability, i.e a "mark" between 0 and 1, measuring the likelihood to not become a high value customer (proba_false), and the opposite likelihood to become a high value customer (proba_true). We can only focus on the latter since this is what is of interest to us (and basically because proba_true = 1 - proba_false).

The prediction column is the decision based on the probability and the threshold value of the scoring recipe.

Hence, whenever the column proba_true will be above 0.65, then DSS will decide to make a prediction "true".

Wow that was a lot of information! Congrats!

Wrap up

That's it! You may now know enough to build your first predictive model, and analyze its results and deploy it. These are the first steps towards a more complex application.

You can head to the User Guide for more specialized tutorials, tips and howtos.

Thank you.