Using MLLib in DSS interface

October 02, 2015

Apache Spark comes with a built-in module called MLLib, which aims at creating and training machine learning models at scale.

Dataiku DSS makes it easy to use MLLib without coding, using it at an optional backend engine for creating Models directly from within its interface.


You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+ (1.5 may even be better as the MLLib API is evolving quickly).

We use here the usual Titanic dataset, available for instance from the corresponding Kaggle’s competition. Start with downloading the files, and create the two train and test datasets.

Training a MLLib model

Double-click on yor train dataset, and create a new Analysis using the green button at the top right. From the Survived column header, click on Create Prediction model…:

This is the important part. In the new modal window, in the ML Backend drop-menu menu, select a Spark configuration (we’ll use the default here):

Create the model. You are taken to a screen telling when the Model is ready to be trained. Do not train the model, but instead click on Settings:

Under the Algorithms section, activate Random Forests:

Click on Train, and wait for your task to complete. Once done, the summary results screen appears:

Your models are now trained. They are ready to be deployed to automate their use to score new records

Using a MLLib model

The Random Forests offer the best performance. Click on it, and from the top right, select DEPLOY:

You are taken to the Flow screen. From the last green Prediction icon, select Apply and create a scoring recipe that will be used to score the test set:

Your flow is now complete:

You’ll just need to actually build the dataset to get the predictions!

Using MLLib in DSS can now be done entirely from the interface, without having to write complex code. This opens up great opportunities as more and more people will be able to leverage Spark to analyse data.