Using MLLib in the Dataiku DSS interface

Applies to DSS 2.1 and above | October 02, 2015

Apache Spark comes with a built-in module called MLLib, which aims at creating and training machine learning models at scale.

Dataiku DSS makes it easy to use MLLib without coding, using it at an optional backend engine for creating Models directly from within its interface.


You have access to a 2.1+ version of DSS, with Spark enabled, and a working installation of Spark, version 1.4+ (1.5 may even be better as the MLLib API is evolving quickly).

We use here the usual Titanic dataset, available for instance from the corresponding Kaggle’s competition. Start with downloading the files, and create the two train and test datasets.

Training a MLLib model

Double-click on yor train dataset, and create a new Analysis using the green button at the top right. From the Survived column header, click on Create Prediction model…: Creating a new prediction model from within a Dataiku analysis

This is the important part. In the new modal window, in the ML Backend drop-menu menu, select a Spark configuration (we’ll use the default here): Choosing Spark as the machine learning backend

Create the model. You are taken to a screen telling when the Model is ready to be trained. Do not train the model, but instead click on Settings: Where to find the Settings of a new prediction model

Under the Algorithms section, activate Random Forests: Choosing the algorithms to run in a prediction model

Click on Train, and wait for your task to complete. Once done, the summary results screen appears: Summary results screen for a prediciton model

Your models are now trained. They are ready to be deployed to automate their use to score new records

Using a MLLib model

The Random Forests offer the best performance. Click on it, and from the top right, select DEPLOY: Deploying a prediciton model

You are taken to the Flow screen. From the last green Prediction icon, select Apply and create a scoring recipe that will be used to score the test set: Creating a Scoring recipe from a deployed model

Your flow is now complete: Completed flow with deployed prediction model built using MLLib algorithms

You’ll just need to actually build the dataset to get the predictions!

Using MLLib in DSS can now be done entirely from the interface, without having to write complex code. This opens up great opportunities as more and more people will be able to leverage Spark to analyse data.