This is part 2 of Tutorial: Machine Learning. Please make sure that you have completed the first part before starting, since we’ll be continuing where we left off.
In this part, we will learn how to use a predictive model to score new records, automating the use of this model as you would for a real application.
We will go through the following steps:
- deploying a model to the Flow
- using this deployed model to score records from another dataset
- understanding the different components used by DSS during this workflow
What are we going to do?
In the first part, we trained a model to predict the “high revenue potential” of customers for whom we have already observed their previous long-term behavior (they were stored in the customers_labeled dataset).
Now, we have some new customers, for whom we have the first purchase, and we want to predict whether they’ll turn out to be “high revenue customers”. This is the customers_unlabeled_prepared dataset. In this dataset, we do not have an indication of whether they are high revenue.
Start by going back to your Tutorial: Machine Learning project. Go to the Flow, click on the customers_labeled dataset, and click on the LAB button.
The Visual Analysis Lab should be as you left it at the end of part 1, with the corresponding Script. Open Models tab. You should see with the 6 models you trained. Click on your best model: the last random forest.
Naming and describing models
Did you know that you can give names and descriptions to each individual model? This helps you find out your best models. You can also "star" a model to find it more easily.
We are now going to deploy this model to the Flow, where we’ll be able to use it to score another dataset. Click on the Deploy button on the top right.
A new important popup shows up. It will let you create a new Train recipe. Train recipes, in Dataiku DSS, are the way to automatically deploy a model in the dataflow, where you can then use it to produce predictions on new records.
We’re not going to deploy a lot of models, so let’s change the model name to a more manageable
Random Forest, and click on the Create button:
You will now be taken back to the Flow. Two new green items are displayed. The first one is the actual train recipe, and the second one is its output, the model. Now click on the model icon and look at the right panel.
You have access to some interesting features here. If you choose Open, you will be redirected to a view, similar to the Models one from your previous analysis bench, but focusing only on the model you chose to deploy (the random forest):
Without going into too much detail in this tutorial, notice that the model is marked as the Active version. If your data were to evolve over time (which is very likely in real life!), you would have the ability from this screen to train again your model (clicking on Actions and then Retrain). In this case, new versions of the models would be available, and you would be able to select which version of the model you’d like to use.
Go back to the Flow (hey, did you know that you can type the letter “g” on your keyboard, followed by the letter “f”?
g + f is a shortcut to go to the * Flow*!), click again on the actual model output icon, you can see a Retrain button close the Open one. This is a shortcut to the function described above: you can update the model with new training data, and activate a new version.
Finally, the Score icon is the one we are looking for to use the model:
Click on it, and a popup window shows up. This is where you set up a few things:
- the dataset you want to score (i.e. apply a predictive model in order to get predictions, the “scoring” process). Here it is customers_unlabeled_prepared.
- the Prediction Model you want to use (already selected)
- a name for the output dataset
- the connection you want to store the results into
Fill the values and hit the Create recipe button:
You are now in the scoring recipe.
The threshold is the optimal value computed to maximize a given metric (in part 1). In our case it was set to 0.625. Rows with probability above the threshold will be classified as high value, below as low value.
You can now click on the Run button at the bottom left to score the second dataset.
Few seconds later your should see Job succeeded.
Go back to the Flow screen, you can visualize your final workflow:
- start from the “history data”
- apply a training recipe
- get a trained model
- apply the model to get the scores on a the unlabeled dataset.
We’re almost done! Double-click on it customers_unlabeled_scored to see how the scored results look like.
Three new columns have been created at the right:
The two “proba” columns are of particular interest. The model provides a probability, i.e a “mark” between 0 and 1, measuring the likelihood to not become a high value customer (proba_False), and the opposite likelihood to become a high value customer (proba_True). We can only focus on the latter since this is what is of interest to us (and basically because proba_True = 1 - proba_False).
The prediction column is the decision based on the probability and the threshold value of the scoring recipe.
Hence, whenever the column proba_True will be above 0.625, then Dataiku DSS will decide to make a prediction “True”.
That’s it! You now know enough to build your first predictive model, and analyze its results and deploy it. These are the first steps towards a more complex application.
You can head to the User Guide for more specialized tutorials, tips and howtos.